This report critically analyses the trends in enrolments for senior secondary (Year 11 and Year 12) mathematics and science subjects in Queensland. Initially, the dataset will be described, from how the dataset is obtained to how the raw data was restructured, cleaned and augmented to facilitate the analysis. This includes imputing missing values, using regular expressions to manipulate character strings and relational joins to combine datasets with matching observations. Next, exploratory data analysis will be conducted using modern data exploration tools to uncover the underlying structures and interesting features of enrolment in all subjects; This will consist of graphical displays such as time plots to better understand the dataset.
The aforementioned measures taken above gives readers an intuition of the dataset, the report will subsequently focus on building a statistical model to predict enrolments for senior secondary mathematics and science subjects. As the dataset is longitudinal, that is, involving a collection of data (enrolments) from the same school across multiple time points, a multilevel model is deemed appropriate for this analysis. This section of the report will further elaborate the methodology in producing the most appropriate model and the predictions from the ‘best’ model.
The following research questions guided the report:
| Variable | Description |
|---|---|
| Enrolment information | |
| completion_year | Graduation cohort’s year of completion |
| year_11_enrolments | Number of Year 11 students enrolled in the subject |
| year_12_enrolments | Number of Year 12 students enrolled in the subject |
| Subject information | |
| qcaa_subject_id | QCAA subject ID |
| subject_name | Subject name |
| subject | Mathematics or Science subject |
| subject_type | Type of subject (General, applied or short courses) |
| School information | |
| qcaa_school_id | QCAA school ID |
| school_name | School Name |
| doe_centre_code | Department of Education Centre Code |
| qcaa_district | School district |
| school_postcode | School postcode |
| sector | School sector (Government, Independent, Catholic) |
The dataset used for this analysis encompasses enrolment information for all participating schools in the QCE system, and can be extracted in the Queensland Curriculum and Assessment Authority (QCAA) website, where the enrolments data for the old QCE system (enrolments spanning from 1992 to 2019) can be found under the “Statistics before 2020” toggle (The State of Queensland (Queensland Curriculum & Assessment Authority) (2019b)) while the enrolments data for the new QCE system can be found under the “Statistics from 2020” toggle.
Each row (observation) corresponds to a school’s enrolment information for a subject in a given year, in addition to the school and subject details. The dataset can be broken down into the three main components, as seen above (Table 1). The data above has been cleaned and is ready for the format that allows for analysis, the further sections will describe the data wrangling process in details.
Senior schooling is a pivotal moment for Queensland students as most students transition into tertiary education. The new QCE system was introduced from Year 11 students in 2019, where Overall Position (OP) scores will be replaced by the new Australian Tertiary Admission Rank (ATAR) system in assisting universities in selecting applications in their courses. For this reason, the Queensland Core Skills (QCS) test for year 12 students, which has been in place since 1992 is abolished (after Year 12 in 2019) for the new ATAR system. The ATAR is a inter subject percentile-rank (as opposed to a mark), where students are ranked relative to their age group; To be specific, ATAR is expressed on a 2,000 point scale from 99.95 down to 0.00 in steps of 0.05.
Queensland ATARs are based on a student’s:
In the ATAR calculation, only one applied subject can be included in the ATAR calculation. For instance, if a student enrols in Essential Mathematics (applied subject) and Science in Practice (applied subject), only the higher of both
In light of the new system, a new suite of senior subjects are reviewed and redeveloped, along with new ones introduced. The new senior syllabuses include general (extension) subjects for students who wish to pursue tertiary pathways, vocational education and training, and work. These subjects will have four summative assessments — three internal and one external examination at the end of Year 12, which are set and marked by independent teacher assessors accredited by the Queensland Curriculum and Assessment Authority. On the other hand, applied (essential) subjects are mainly for students interested in pathways beyond school that primarily lead to vocational education and training or directly to work. These subjects tend to be more ‘real world’ focus and do not have external examinations but have a total of four summative internal assessments.
Notably, this report will not cover short courses, which are suited for students who wants to establish a basis for further education or work, typically leading to further vocational education and training.
| Old subject name | New subject name |
|---|---|
| Applied Mathematics | |
| — | Essential Mathematics |
| General Mathematics | |
| Mathematics A | General Mathematics |
| Mathematics B | Mathematical Methods |
| Mathematics C | Specialist Mathematics |
There are three main mathematics subjects in the new QCE system, which are broadly equivalent to the Mathematics A, B and C in the old QCE system (Table 2). Of the general mathematics subjects, General Mathematics is the least taxing, followed by Mathematical Methods and Specialist Mathematics. Furthermore, Essential Mathematics (applied subject) is introduced in the new system, where it focuses on developing and using mathematical knowledge to investigate real-world problems such as managing money, travel and data, and decision-making using mathematical concept (The State of Queensland (Queensland Curriculum & Assessment Authority) (2019a)). Each school may offer various mathematics programs — e.g. offer two of the three general mathematics subjects, or only offer Essential Mathematics (applied subjects).
Mathematical Methods is often a prerequisite for some university courses, implying that students completing the subject can streamline entry to specific courses in the university. Specialist Mathematics is designed to be taken in conjunction, or on completion of Mathematical methods, as it extends the key ideas studied in that subject. This path is usually recommended for students who wish to undertake further study in mathematics — such as engineering or science related courses or have strong aptitude in mathematics.
As an alternative, there is a safety net as students may opt for a less challenging mathematics subject such as Essential Mathematics subject (applied subject), as it would also contribute to the student’s ATAR score. This subject is usually undertaken by students who do not require mathematics for tertiary education at university of have not been able to attain a pass in Year 10 Core Mathematics. However, it should be noted that only general mathematics subjects or applied mathematics subjects can included in the ATAR, but not both. For instance, it is not possible to incorporate Mathematical Methods (general subject) and Essential Mathematics (applied subject) in a student’s ATAR.
| Old subject name | New subject name |
|---|---|
| Applied Sciences | |
| Agricultural Practices | Agricultural Practices |
| Aquatic Practices | Aquatic Practices |
| Science in Practice | Science in Practice |
| General Sciences | |
| Agricultural Science | Agricultural Science |
| Biology | Biology |
| Chemistry | Chemistry |
| Earth Science | Earth & Environmental Science |
| Marine Science | Marine Science |
| Physics | Physics |
| — | Psychology |
Unlike mathematics subjects, there are little changes in the subject names. As demonstrated in Table 3 There are two main changes in science subjects — ‘Psychology’ is introduced in the new QCE system and Earth Science is renamed to Earth and Environmental Science.
To get the dataset ready for the analysis, the dataset is transformed to reproduce that of the new QCE system.
Changing all subject names to match new QCE system: To combine the datasets relating to the old and new QCE system, the subject names were first matched. As demonstrated in Tables 2 and 3, some subjects names were changed in the new QCE system. All subject names in the old system were changed to that of the new system to facilitate the relational join. For example, Mathematics A, B and C were modified to General Mathematics, Mathematical Methods and Specialist Mathematics respectively to match that of the new QCE system.
Changing unit 1 and 2, and unit 3 and 4 enrolments to year 11 and year 12 enrolments: Next, the new QCE system introduces an assessment program consisting of Units 1 and 2. In general, most students complete Units 1 and 2 in Year 11, and Units 3 and 4 are undertaken in year 12. Accordingly, units 1 and 2, and units 3 and 4 are renamed to year 11 enrolments and year 12 enrolments to facilitate the relational join of both datasets corresponding to the old and new QCE system. Once the school and subject information were aligned, both datasets were joined seamlessly.
| School Name (Old QCE system) | School Name (New QCE system) | Modified School Name |
|---|---|---|
| Fixed by removing parantheses | ||
| St Catherine’s Catholic College The Whitsundays (Proserpine (Renwick Road)) | St Catherine’s Catholic College The Whitsundays (Proserpine) | St Catherine’s Catholic College The Whitsundays |
| Lutheran Ormeau Rivers District School (Lords) | Lutheran Ormeau Rivers District School (Pimpama) | Lutheran Ormeau Rivers District School |
| Faith Baptist Christian School (Gladstone) | Faith Baptist Christian School (Burua) | Faith Baptist Christian School |
| Fixed by removing dashes | ||
| Rivermount College - Yatala | Rivermount College (Yatala) | Rivermount College |
| Carinity Education - Southside | Carinity Education - Southside Sunnybank | Carinity Education |
Regular expressions were utilised to remove school names that did not match in the datasets for the old and new QCE system. An extract of what has been done is shown above (Table 4). Fortunately, schools are uniquely identified with QCAA school ID, easing this process as incompatible school names can be easily identified. In most cases, removing parentheses “()” and dashes “-” relating to suburb solves the issues.
| School Name (Old QCE system) | School Name (New QCE system) | Issue (Why school names does not match) |
|---|---|---|
| Clairvaux Mackillop College | Clairvaux MacKillop College | Capital K in ‘MacKillop’ |
| St Aidan’s Anglican Girls’ School | St Aidan’s Anglican Girls School | Extra ’ in ‘Girls’ |
| Cloncurry State School P-12 | Cloncurry State School | Extra ‘P-12’ |
| Aboriginal & Islander Independent Community School | Aboriginal and Islander Independent Community School | ‘&’ used instead of ‘and’ |
However, there were cases in which school names have to be recoded manually as the school names were inconsistent. Some of the examples are demonstrated in Table 5.
Figure 1: At-a-glance plot of the missing values in the raw dataset, where the pink cells indicate a missing value. Most observations are present, except for three variables. The missing values in year 11 and 12 enrolments relates to schools which offered a subject for year 11 but not for year 12 and vice versa
Figure 1 illustrates that most of the observations were present, except for there variables relating to enrolments (year_11_enrolments and year_12_enrolments) and the Department of Education Centre Code (doe_centre_code).
Zero enrolments in Year 11
First, 67.13% of the missing values in year 11 enrolments relate to the 2019 graduating cohort that were part of the Queensland’s non-compulsory first Prep Year cohort in 2007. This cohort commenced in 2007 at the current time and will graduate from year 12 in 2019. Therefore, from 2020, every year will be a full cohort of students, implying an additional group of students each year and thus higher expected enrolment for the new QCE system. As this 2007 prep year cohort leaves the schooling system, the enrolments from 2020 will increase considerably as a full cohort will be realised, as will be seen in the later sections.
Zero enrolments in Year 12 Next, 71.48% of the zero enrolments in year 12 corresponds to the first year in which a given school introduces a subject. As year 12 enrolments requires students to complete year 11 syllabus, most students may not partake in the subject when the schools first introduce the subject. In further scrutiny, the other schools that have zero enrolments are usually smaller schools (in rural areas) or schools may only offer a subject for year 11 students but not year 12 students in certain years and vice versa. For example, Mossman State High School offered Specialist Mathematics subject in some years — Offered the subject to Year 11 students of 2020 and Year 12 students of 2021, but not for Year 11 students of 2021 and Year 12 students of 2022 (Mossman State High School (2019)).
Missing Department of Education Centre Code
Schools that were missing the Department of Education Centre Code refers to schools that existed before 2002. This alludes that these schools may not have been assigned any Department of Education Centre Code. Fortunately, the dataset provides qcaa_school_id which have no missing values and can uniquely identify each school, and thus, thee missing values in doe_centre_code may be disregarded.
Errors in school postcodes usually arise form school postcodes codes being reassigned overtime, due to typographical errors, mainly for schools with two or more different locations. These issues were fixed manually by cross checking the correct addresses for the school provided by Google Maps.
Figure 2: Total number of participating schools for each subject per year
Figure 2 displays the total number of participating schools for a given year. Some subjects showed rather stable enrolment figures such as Biology and Chemistry, which showed a steady increase in participating schools over the years. Other subjects like Science in Practice and Earth and Environmental Science showed a rather erratic trend, where the first year of Earth and Environmental Science subject showed the largest number of participating schools while Science in Practice subject showed a one-off year (2016) in which the total schools was significantly greater than the other years.
Figure 3: Average enrolments for all schools for each year, enrolments for the new QCE system are coloured in a darker shade of grey
As mentioned, since the 2007 prep year cohort left the schooling system at the end of 2019, enrolments will be replaced by a full cohort. It is not surprising therefore that all subjects saw a surge in enrolments from 2019 to 2020. Based on a study by Independent Schools Queensland (Independent Schools Queensland (ISQ) (2020)), this resulted in approximately 6.2% (22,050) increase in secondary student enrolments from 2019 and 2020 in independent sectors alone.
Subjects were introduced at different time frames. Aquatic practices and Marine Science were only introduced in 2014 and 2015 respectively. There was a halt in Agricultural Practices subject in year 2000, before being re-introduced again in 2015. Furthermore, the inception of the new QCE system was two new subjects, Psychology and Essential Mathematics.
Figure 4: Time plot of enrolments of all the subjects across the years
Figure 4 displays a spaghetti plot of year 12 enrolments for each subject, with overall fit using the the mean of enrolments for all schools in a given year. Each line represents year 12 enrolments for a single school. For most subjects, some relatively large schools can be identified as their enrolments numbers were significantly higher than most other schools. The year 12 enrolments for these large schools were especially prominent in the later years, where the new QCE system was introduced.
Some subjects that were introduced in the 1990s such as Biology, Chemistry, Physics, General Mathematics and Mathematical Methods showed rather stable enrolment numbers as enrolment trend across schools appears to be proportional to one another. In other words, when there is a general increase (decrease) in enrolments, most schools will show an increase (decrease), as seen by the large cluster of observations. In contrary, Agricultural Earth and Environmental Science, although introduced in the 1990s, did not exhibit such patterns as the enrolment numbers were rather erratic. However, it can be observed that there were significantly less enrolments for these subjects, implying that not many schools offer this subject or the subject is not popular among students.
Figure 5: Year 12 enrolments agianst year 11 enrolments, with a 45° line to indicating a perfect linear relationship
A scatterplot of year 11 and year 12 are plotted in Figure 5, with a 45° line to facilitate the interpretation of the scatter plot, where it indicates a one unit increase in year 11 enrolments is matched with a one-unit increase in year 12 enrolments (i.e. perfect linear relationship). As already mentioned, the 0 enrolments in year 11 (as seen on the figure) is attributed to the Queensland’s first Prep Year cohort in 2007 while the 0 enrolments in year 12 relates to the year where a school introduces the subject for the very first time (thus, there will be no year 12 students as most students have yet to complete year 11).
Most subject demonstrates a linear relationship (lie on or close to the red line) between year 11 and year 12 students. Intuitively, this is reasonable as a student is likely to continue to complete the year 12 syllabus (unit 3 and 4) after completing his year 11 syllabus (unit 1 and 2).
Observations below the red line may suggest that students have dropped the subject or have taken a gap year. Students are usually recommended to enrol in 6 subjects for senior school (Art of Smart (2021)). A student usually starts with 6 subjects in year 11, before dropping down to 5 subjects in Year 12. However, completing 6 subjects gives the student a “safety net,” in the instance where the student did not perform well for a particular subject (noting that ATAR takes the best of 5 subjects). Taking this into account, students appears to be dropping Psychology subject, as Figure 5 demonstrates that most year 12 enrolments were below the red 45° line.
In contrary, observations that lie above the red line may indicate that these students may have completed the year 11 (unit 1 and 2) in year 10, and re-enrolled in year 12. Furthermore, students may opt to study at a school of distance education to complete their year 11 (unit 1 and 2) syllabus because of various reasons such as the school not offering year 11 syllabus for a given year.
Some science subjects such as Earth and Environmental Science, Marine science, and Science in Practice displays heteroscedasticity, where the larger the cohort size, the higher the rate of students dropping the subject or having more enrolments in year 12. Interestingly, these science subjects have a relatively small number of enrolments. This pattern does not seem to manifest in mathematics subjects.
Figure 6: Parallel Coordinate plot comparing year 11 and year 12 enrolments for a given cohort in a school. The blue (grey) lines represents an increase (decrease) in year 12 enrolments. In general, larger schools appears to have smaller year 12 enrolments, as indicated by numerous grey lines when enrolment numbers are higher
The parallel coordinates plot (Figure 6) allows comparison of the difference between year 11 and year 12 enrolments. The grey line indicates that year 12 enrolments is smaller than year 11 enrolments for a given cohort in a school and conversely, the blue lines represents an increase in enrolments from year 11 to year 12.
In general, larger schools (with relatively high enrolment numbers for a given subject) appears to have smaller year 12 enrolments as compared to year 11 enrolments. This pattern is especially noticeable in Psychology, where most schools with enrolments higher than 150 see a decrease in enrolment numbers from year 11 to year 12. This was not the case for some subjects such as General Mathematics, Chemistry, and Biology.
Figure 7: Distributions of enrolments for each subject shown with a density plot, overlaid with a density plot
The distribution of enrolments for each subject appears to be right skewed (Figure 7). This is possibly attributed to the size of schools, as some schools are significantly larger than others.
Figure 8: Box-plot of distribution of enrolments for each subject by sector
In most subjects, the distribution in enrolments across all sectors were rather similar (Figure 8). Subjects that were introduced early such as Biology, Chemistry, Specialist Mathematics, and Physics showed stable enrolment numbers, with little variation across the different sectors. As highlighted by the outliers in all subjects, there were schools that has significantly more enrolments, these observations may refer to the larger schools in Queensland. These outliers are considerably prominent, especially in Government schools.
At the outset, a standard linear regression model assumes that residuals are independently and identically distributed (\(i.i.d\)) – i.e. There will be no further correlations (dependence) between measures once predictors are added to the model. Substantively, this suggests that any higher level entities are identical, and can be completely ‘pooled’ into a single observation. With a longitudinal data, this is often an unreasonable assumption, as temporal data are often characterised by marked dependence over time and response for measurements across time are often related to one another. For instance, the division between the within (e.g. schools) and between (e.g. between clusters such as district) would be assumed to be equal such that a one-unit increase in enrolments for within effects would match a one unit increase in between-effects, which may be unlikely the case in this context.
Using the multilevel model allows for simultaneous modelling of both intra-school change (i.e. how school enrolments are changing overtime) and interindividual change (i.e. Differences in temporal change across schools). As the dataset is imbalanced due to schools introducing or ceasing a given subject at different times, the multilevel model is appropriate as it is able to utilise available data from incomplete observations without reducing sample size.
In general, multilevel models takes on the general form: Repeated measures (level 1) nested within individuals (level 2) and possibly individuals nested within some higher-level clusters (level 3). With respect to the data, level 1 relates to a single measure in time, level 2 relates to the individual schools offering the maths and science subjects and level 3 corresponds to the schools nested within districts or postcodes. As our data is nested within schools and possibly in higher-level clusters, the multilevel model is appropriate.
The data is aggregated at the school-level, implying that more granular information such as the student-level is not available. With the new QCE system, students are allowed to choose from a range of courses, and not all will pursue a mathematics and science pathway. This serves as a disadvantage as information such as proportions cannot be computed. Ideally, the number of students pursuing some STEM programs may be extracted, such that the proportion of students who are enrolled in a particular STEM subject and is pursuing a STEM career pathway can be computed.
Some subjects (e.g. Essential Mathematics and Psychology) are newly offered in the new QCE system. This insinuates that there are only observations spanning from 2020 to 2022 for these subjects. Although the multilevel model is efficient in using the available data (even without complete observations), it has its own drawbacks, which will be discussed in the models of each subject.
As aforementioned, the objective of this report is to predict enrolments in light of the new QCE system.
A basic linear model within schools provide an at-a-glance visualisation of the enrolment trends within each school. An advantage of this visualisation is that the linear model summarises enrolment trends with only two summary statistics, an intercept and a slope, which is useful for panel data as enrolments can be viewed overtime per school. The intercept conveys a school’s enrolments when the subject was first introduced to the school while the slope indicates the school’s yearly average increase in enrolments over the period in which the subject is or was running.
Each subject encounters different circumstances (e.g. new subjects introduced only in 2020), the following steps are the main steps taken to get the dataset for a single subject ready for modelling. However, it should be noted that in some subjects, more or less steps could be taken given the context.
Figure 9: Year 11 enrolments, with a vertical line indicating year 2019, where enrolments in most schools dropped to zero due to the 2007 prep year cohort.
Subjects with 0 enrolments for a given cohort will be removed as these enrolment figures will artificially inflate/deflate enrolment numbers. As outlined, most of the zero enrolments in 2019 was due to the completion of the 2007 prep year cohort. These zero enrolments can be seen in Figure 9, where the vertical red line indicates the year 2019.
By the same token, most year 12 subjects with with zero enrolments relates to the first year schools introduced the subject. In this year, most students would be ineligible to enrol in year 12 units as most year 12 units requires the completion of year 11 units. Furthermore, the other zero enrolments were due to small schools who had little to no enrolments in a particular year.
enrolments using log transformationAs seen in the density plots for each subject (Figure 7), enrolment numbers are right skewed due to the various school sizes in Queensland. This means that most of the variance captured in the dataset can be attributed to the various school sizes in Queensland. As such, a log transformation will be utilised to allow models to be estimated by the linear mixed models. A log transformation is appropriate in this context as enrolments are by nature, a positive number.
completion_year)It is often helpful to rescale time variable so the first measurement occasion is the zero point. In this context, the completion_year for each subject will be rescaled to 0, where 0 corresponds to the first year in which the subject was introduced. For instance, Specialist Mathematics subject was introduced in 1992, and thus, will be the zero point (variable named year92). Rescaling time variable gives the intercept the interpretation of the baseline, or initial status of the dependent variable (i.e. 0 as the commencement of the subject)
Figure 10: Nested structure of the dataset
The multilevel model was primarily used because of the underlying nested structure in the dataset; A nested structure occurs when individual data points as one level (i.e. schools in this context) appear in only one level of a higher-level variable (e.g. school postcodes or QCAA district). With a longitudinal data, time predictors will be associated with a single measurement in time (i.e. first year in which a school introduces the subject).
Accordingly, three potential models can be identified from the nesting structure above (Figure 10):
Presently, information criteria selection such as AIC dominate the model selection criteria in comparing non-nested multilevel model. This report will follow suit, by comparing the different non-stationary random effects using AIC to select the ‘best’ initial model.
In a multilevel context, it is often helpful to begin with an unconditional means (null) model where there are no predictors at any level, only an intercept. This model model provides useful information in understanding the structure of the data. It can be utilised to obtain estimates of residuals and intercept variance that allows for assessment of the variation at each level, and in comparing variability in (1) schools within clusters for a level three model or (2) within schools for a level three model. However, before delving into the specifics, all potential models have to be considered.
For instance, the intraclass correlation coefficient (ICC = \(\rho_i\)) can be computed. ICC refers to proportion of variance in the response (enrolments) that can be explained by the grouping structure of the hierarchical model; It is often easier to conceptualise it as the correlation between the enrolments of two schools, randomly selected from a cluster or as the same amount of variance explained by those groupings (similar to an \(R^2\)).
The three-level unconditional means model can be expressed as:
Level one (time point within schools) \[Y_{tij} = \pi_{0ij} + \epsilon_{tij}\]
Level two (schools within clusters) \[\pi_{0ij} = \beta_{00j} + u_{0ij}\]
Level three (clusters) \[\beta_{00j} = \gamma_{000} + r_{00j}\]
In this case, clusters (i.e. in this dataset, districts or postcodes) are considered to be independent, but schools within the same districts and the measurements at different time points for the same school are correlated. If a two-level multilevel model is used, level three can be disregarded.
At level one:
enrolments) of school \(i\) from cluster \(j\) at time \(t\)enrolments) of school \(i\) from cluster \(j\) across all time pointsenrolments (\(y_{tij}\)) and mean enrolments for school \(i\) in district \(j\)At level two:
enrolments for cluster \(j\) across all schools from the given (\(j^{th}\)) districtenrolments of school \(i\) from district \(j\) from the mean enrolments of all schools in district \(j\)At level three:
enrolments across all schools, districts and time pointsenrolments from all observations in district \(j\) and the overall mean enrolments across all districts, schools and time points.In combining level 1,2 and 3, the composite model can be obtained:
\[\begin{aligned} Y_{ij} & = \pi_{0ij} + \epsilon_{tij} \\ & = (\beta_{00j} + u_{0ij}) + \epsilon_{tij} \\ & = (\gamma_{000}) + (r_{00j} + u_{0ij} + \epsilon_{tij}) \end{aligned}\]
The model can be split into two parts. The ‘fixed effects’ will be represented in the first parentheses of the last expression, usually denoted by \(\gamma\) while the ‘random effects’ will be represented in the second parentheses, denoted by \(r\), \(u\) and \(\epsilon\) components.
where,
\[\begin{aligned} \epsilon_{tij} & \sim N(0, \sigma^2) \\ u_{0ij} & \sim N(0, \tau^2_{00}) \\ r_{00j} & \sim N(0, \phi^2_{00}) \end{aligned}\]
Where, - \(\epsilon_{tij}\): Variance over time within schools - \(\tau^2_{00}\): Variance between schools from the same cluster - \(\phi^2_{00}\): Variance across clusters
The unconditional growth model incorporates time as a linear growth at level one (with no predictors at level two and three) to reduce unexplained variability within schools. This model allows for assessing the within-school variability that can be attributed to the systematic changes over time.
To reduce correlation between the linear components of the time effect, the time variable rescaled. For this report, time variable (completion_year) is rescaled to the zero point, that is, the first year in which the school introduced the subject. This gives the intercept the interpretation of baseline or initial status of the enrolments.
As with the unconditional means model, the unconditional growth model can be expressed as:
Where, year is the time variable.
Level two (schools within clusters) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + u_{1ij} \end{aligned}\]
Level three (clusters) \[\begin{aligned}\beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{10j} &= \gamma_{100} + r_{10j} \end{aligned}\]
As compared to the null-model, the between-school and between-clusters variability are now partitioned into variability in initial status (\(\tau^2_{00}\) and \(\phi^2_{00j}\)) and variability in rates of change (\(\tau^2_{10}\) and \(\phi^2_{10}\)); i.e.
At level one:
enrolments)At level two:
At level three:
In combining the levels, the full composite model may be obtained:
\[\begin{aligned} Y_{tij} &= \pi_{0ij} + \pi_{1ij}year_{tij} + \epsilon_{tij} \\ &= (\beta_{00j} + u_{0ij}) + (beta_{10j} + u_{1ij})year_{tij} + \epsilon_{tij} \\ &= (\gamma_{000} + r_{00j} + u_{0ij}) + (\gamma_{100} + r_{10j} + u_{1ij})year_{tij} + \epsilon_{tij} \\ &= [\gamma_{000} + \gamma_{100}year_{tij}] + [r_{00j} + u_{0ij} + \epsilon_{tij} + (r_{10j} + u_{1ij})year_{tij}] \end{aligned}\]
With, \(\epsilon_{ijk} \sim N(0, \sigma^2)\),
\[\left( \begin{matrix} u_{0j} \\ u_{1j} \end{matrix} \right) \sim N \left( \begin{matrix} 0 \\ 0 \end{matrix} \begin{matrix} , \end{matrix} \begin{matrix} \tau^2_{00} & \tau_{01} \\ \tau_{01} & \tau^{2}_{10} \end{matrix} \right)\]
\[\left( \begin{matrix} r_{00j} \\ r_{10j} \end{matrix} \right) \sim N \left( \begin{matrix} 0 \\ 0 \end{matrix} \begin{matrix} , \end{matrix} \begin{matrix} \phi^2_{00} & \\ \phi_{01} & \phi^{2}_{10} \end{matrix} \right)\]
The multilevel models may encounter boundary constraint, especially when random slopes are introduced at level two and three. This singular fit error occurs as correlation coefficients between two error terms are at the ‘boundary’ of the allowable values \([1,1]\). In this report, the error is an indication of overfitting of the model (Ben Bolker et al. (2021)) – That is, the random effects structure proves to be too complex to be supported by the data.
Naturally, this suggests that the model requires re-parameterisation by altering to feature different parameters. This report considers simplifying the model by removing level three error terms (\(r_{10j}\)); which effectively removes two parameters (1) Variance for \(r_{10j}\) and (2) Correlation between \(r_{01j}\) and \(r_{10j}\)
With respect to the unconditional growth model, the model is simplified by setting the error term \(r_{10}j = 0\) (and therefore \(r_{10} = 0\)) at level three:
\[\begin{aligned} \beta_{00j} = \gamma_{000} + r_{00j} \\ \beta_{10j} = \gamma_{100} \end{aligned}\]
This implies that the model’s error assumption at level three will turn into a univariate (as opposed to a multivariate) normal distribution. Furthermore, this will also imply that level-three (e.g.) clusters will be associated with the same growth rate; In other words, each cluster will have the same rate of change in enrolments for each year. In most cases, removing these parameters will have an insignificant impact on the fixed effects. Therefore, the benefit of this approach is that it will lead to a more parsimonious model that is not over-fitted and free from any boundary constraints.
sector and unit as fixed effectsIn general, there are two types of predictors which may be added to the longitudinal model (1) Time varying (level 1): predictors that are associated with a specific measurement of time (2) Time invariant (level 2 or higher): Predictors associated with schools or clusters measured only at one point in time
At this point, the model only has a dedicated time variable that makes the multilevel longitudinal, with no other predictors added. With reference to Table 1, this report will consider school-specific and subject-specific variables as there are no cluster-specific (district or postcode information) variables. To be specific, two time-invariant predictors (sector and unit) will be added to the model as fixed effects. To be noted, these categorical variables will not be added as random effects as there is not enough random-effect levels (e.g. blocks) for these predictors. As a rule of thumb, there should be more than 5 or 6 levels at a minimum (Ben Bolker et al. (2021)). Furthermore, adding another level to the model may overcomplicate interpretations.
In adding sector and unit into the model, there will be a handful of potential models (e.g. three-way interactions and two-way interactions, removing three-way interactions etc.). This report considers a forward stepwise approach, where the largest possible model (all possible combinations of fixed effects) will be fitted, before removing the fixed effects sequentially (one after another). The AIC will then be recorded for each model; Noting that models will be fitted with with maximum likelihood estimates rather than the default restricted maximum likelihood estimates (REML) as REML will generally lead to erroneous results as the fixed effects change (Faraway (2016)). The model with the lowest AIC will be deemed the “best” model, based on the fixed effects.
The parametric bootstrap will be utilised to examine if simplified (smaller) version of model (with fewer random effects) is preferred. For a three-level multilevel model, this involves removing the random slope (\(r_{10j}\)) in the larger (proposed) model at level-three to create the smaller model. Removing \(r_{10j}\) is equivalent to setting \(\phi_{01} = 0\) and \(\phi_{10} = 0\). In other words, the parametric bootstrap tests if the variance of a random effect \(r_{10j}\), is in fact 0.
The parametric bootstrap better approximates the correct \(p\)-value for the corresponding likelihood ratio test by simulating under the null hypothesis, especially when the actual likelihood ratio test statistic is at the vicinity of the critical value of the (wrong) reference distribution. The standard \(\chi^2\) asymptotic are not used as it is proven to be produce a ‘too conservative’ test, where the \(p\)-values are usually too large and not rejected enough (Faraway (2016)).
In conducting the parametric bootstrap involving variance terms at the boundary, the null and alternative hypothesis can be written as: \[\begin{aligned} H_0: \phi^2_{00} = \phi_{01} \ne 0 \\ H_A: \phi^2_{00} = \phi_{01} = 0 \end{aligned}\]
Begin by fitting both (full and smaller) models to the actual data with maximum likelihood estimates to obtain the actual likelihood ratio test statistic (\(2 \times logL(\text{full model}) - logL(\text{reduced model})\)). This will be called the “actual (observed) likelihood ratio test statistic.”
Using the smaller model, obtain estimated fixed effects & variance components (parametric).
Use estimated fixed effects and variance components to generate pseudo-data from the null model – i.e. regenerate a new set of response (log(enrolments)) with its associated fixed effects for each observation (bootstrap).
Fit both smaller and reduced model to the new pseudo-data.
Compute likelihood ratio test statistic comparing both models.
Repeat steps 2-4 for a large number of times (this report simulates \(B = 1,000\) bootstrap samples).
The estimated \(p\)-value is computed by finding the proportion of likelihood ratio test statistic that exceeds the actual (observed) value in step (1). That is, the chance of observing the large (small) actual likelihood test statistic or check if the large (small) actual test statistic was a rare occurrence.
Note:: Code is provided by fabians ((https://stats.stackexchange.com/users/1979/fabians) (n.d.)); Some modifications were made to the original code to be aligned with the analysis.
In conducting the parametric bootstrap, a histogram of the likelihood ratio test statistic will then be implemented to visualise the results (can be compared with a \(\chi^2\)-distribution with 2 degrees of freedom). If the null hypothesis is rejected at the 5% level (i.e. p-value < 0.05), it indicates that the smaller model is not correct and the full model will be selected.
With our final model, confidence intervals can be constructed for the parameters; This can be done via the confint(method = "boot") function. However, it is always better to get a more intuitive understanding of what the function is doing under the hood.
Simulate data from the proposed model
Refit the model using the new responses from the simulated data and estimate the parameters
Repeat steps 1 and 2 for a larger number of times, storing the results each time
Use quantiles of bootstrapped estimates to compute the estimated intervals
Like any other confidence interval, the output can be interpreted as – if the procedure was repeated many times over, the true value will lie within the interval 95% of the time.
This confidence interval can be used to reinforce the idea that the larger model is better by checking if the variance parameters does not include 0, suggesting that the random effects are significant at the 5% level.
Fixed effects are functions of the covariates that accounts for the differences over time, ‘controlling out’ higher-level differences and processing the distinctive specific characteristics of higher-level units (Bell and Jones (2015)). Fixed effects introduces a separate parameter for each group – equivalent to introducing dummy variables for each higher-level entity (less a reference category) – where each group are independent and has its own individual characteristics. As will be demonstrated, it allows for analysing the trends in enrolments between the different sectors (Catholic, Independent and Government) and units (year 11 and year 12) and the coefficients may be interpreted like a standard regression model.
With the higher-level variance being ‘controlled out’ by the fixed effects, the random effects framework allows for heterogeneity to not only be corrected, but explicitly modelled (Bell and Jones (2015)). The random effects expresses the variation between all schools, which assumes that the higher-level entities comes from random draws of a normal distribution – which the variance is estimated by the model – and can itself be interpretable and relevant. This implies that random effects induces a correlation between schools at the same level (e.g. schools in the same district).
Figure 11: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Mathematical Methods
Figure 11 fits a linear model for 20 randomly selected schools. As enrolments are plotted on the same y-axis, the different sizes of school is apparent. For instance, school 174 has more than 100 enrolments for mathematical methods for each cohort, while school 496 and 518 have little enrolments (< 50) for each cohort. Each school showed rather distinct trend, where school 441 and 304 showed a decrease in enrolments while schools 28, 174, 417 showed an increase in enrolments, on average. It can also be observed that school 391 (St Mary’s College) only introduced the subject in 2020, as it was a relatively new school.
All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.
Figure 12: Effects of log transformation for response variable (enrolments) in Mathematical Methods
As multilevel model assumes normality in the error terms, a log transformation is utilised to allow models to be estimated by the linear mixed models. The log transformation allows enrolment numbers to be approximately normally distributed (Figure 12).
| df | AIC | |
|---|---|---|
| Model0.2: Schools nested within districts | 4 | 24303.39 |
| Model0.1: Schools nested within postcodes | 4 | 24333.96 |
| Model0.0: Within schools | 3 | 24346.27 |
As per step 3, the three potential models are fitted, with the AIC shown in Table 6. Based on the AIC, model0.2, corresponding the schools nested within districts is the best model and will be used in the subsequent analysis.
summary(model0.2)
## Random effects:
## Groups Name Variance Std.Dev.
## qcaa_school_id:qcaa_district (Intercept) 0.83408 0.91328
## qcaa_district (Intercept) 0.14864 0.38554
## Residual 0.17998 0.42424
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.305797 0.1152798 28.67629
##
## Number of schools (level-two group) = 466
## Number of district (level-three group) = 13
This model will takes into account 466 schools nested in 13 districts. In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:
The level-two ICC relates to the correlation between school \(i\) from district \(k\) in time \(t\) and in time \(t^* \ne t\):
\[ \text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.83408}{(0.83408 + 0.14864 + 0.17998)} = 0.7174\]
This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 71.74% of the total variability is attributable to the changes overtime within schools.
The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific school \(j\).
\[ \text{Level-three ICC} = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.14864}{(0.83408 + 0.14864 + 0.17998)} = 0.1278\]
Similarly, this can be conceptualised as the correlation between enrolments of two randomly selected schools from the same district – i.e. 12.78% of the total variability is due to the difference between districts.
The unconditional growth model introduces the time predictor at level one, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to linear changes over time. Furthermore, variability in intercepts and slopes can be obtained to compare schools within the same districts, and schools from different districts.
summary(model1.0)
## Groups Name Variance Std.Dev. Corr
## qcaa_district:qcaa_school_id (Intercept) 2.0601e+00 1.4353101
## year92 1.8682e-03 0.0432225 -0.753
## qcaa_district (Intercept) 6.9225e-02 0.2631056
## year92 5.9806e-05 0.0077335 1.000
## Residual 1.2387e-01 0.3519528
## Estimate Std. Error t value
## (Intercept) 3.066187193 0.100976610 30.365321
## year92 0.009110851 0.003051126 2.986062
## Number of Level Two groups = 466
## Number of Level Three groups = 13
When the subject was first introduced in 1992, schools were expected to have 15.5009 (\(e^{1.7848}\)) enrolments, on average. Enrolments were expected to increase by 4.7598% (\((e^{0.0465} - 1)\times100\)) per year. The estimated within-schools variance decreased by -9.5556% (0.1800 to 0.1628), implying that -9.5556% of within-school variability can be explained by the linear growth over time.
A singular fit is observed in the model as the correlation between the intercept and slope between districts are perfectly correlation (i.e. \(\phi_{01}\) = 1). This may suggest that the model is overfitted – i.e. the random effects structure is too complex to be supported by the data and may require some re-parameterisation. Naturally, the higher-order random effects (e.g. random slope of the third level (between district)) can be removed, especially where the variance and correlation terms are estimated on the boundaries (Roback and Legler (2021)).
summary(model1.1)
## Groups Name Variance Std.Dev. Corr
## qcaa_district:qcaa_school_id (Intercept) 2.0930666 1.446743
## year92 0.0019446 0.044097 -0.757
## qcaa_district (Intercept) 0.2039055 0.451559
## Residual 0.1238374 0.351905
## Estimate Std. Error t value
## (Intercept) 3.053472289 0.14377570 21.237749
## year92 0.009675574 0.00220878 4.380505
## Number of Level Two groups = 466
## Number of Level Three groups = 13
To elaborate, two parameters were removed by setting variance components \(\phi^2_{10}\) = \(\phi_{01}\) equal to zero Which indirectly assumes that the growth rate for district \(j\) to be fixed. As shown in the output above, this produced a more stable model and is free from any boundary constraints. As compared to the unconditional growth model (model1.0), the fixed effects remained rather similar.
Level one and level two will be identical to the unconditional growth model (model1.0), however, the random slope for level 3 will be removed. This implies that the error assumption at level three now follows a univariate normal distribution where \(r_{00j} \sim N(0, \phi_{00}^{2})\).
The new level three (districts): \[\beta_{00j} = \gamma_{000} + r_{00j} \\ \beta_{10j} = \gamma_{100} \]
And therefore composite model: \[\begin{aligned} Y_{tij} &= \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ &= (\beta_{00j} + u_{0ij}) + (beta_{10j} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= (\gamma_{000} + r_{00j} + u_{0ij}) + (\gamma_{100} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= \left[\gamma_{000} + \gamma_{100}year92_{tij} \right] + \left[r_{00j} + u_{0ij} + u_{1ij}year92_{tij} + \epsilon_{tij} \right] \end{aligned}\]
| model | AIC |
|---|---|
| model4.7 | 18176.17 |
| model4.1 | 18177.29 |
| model4.4 | 18177.87 |
| model4.0 | 18178.63 |
| model4.5 | 18178.71 |
| model4.10 | 18238.84 |
| model4.9 | 18238.84 |
| model4.2 | 18238.85 |
| model4.8 | 18238.85 |
| model4.6 | 18239.98 |
| model4.3 | 18241.58 |
As highlighted in step 6, sector and unit will be added as predictors to the model. The largest possible model will be fitted, before removing fixed effects one by one while recording the AIC for each model. In this case, model4.0 corresponds to the largest possible model while model4.10 is the smallest possible model. The model with the optimal (lowest) AIC is model4.7 (Table 7). The next section will test the selected model’s random effects to build the final model.
This step will not be undertaken, as the random slope will not be included at level three as a boundary constraint was found in the unconditional growth model, indicating that the model will be overfitted if random slopes were included at level three.
| var | 2.5 % | 97.5 % |
|---|---|---|
| sd_(Intercept)|qcaa_district:qcaa_school_id | 1.2611487 | 1.4536475 |
| cor_year92.(Intercept)|qcaa_district:qcaa_school_id | -0.7727809 | -0.6789689 |
| sd_year92|qcaa_district:qcaa_school_id | 0.0370458 | 0.0430934 |
| sd_(Intercept)|qcaa_district | 0.2229889 | 0.6290453 |
| sigma | 0.3444005 | 0.3511923 |
| (Intercept) | 3.0378500 | 3.8143678 |
| year92 | -0.0002020 | 0.0197675 |
| sectorGovernment | -0.3032420 | 0.4819675 |
| sectorIndependent | -1.6557356 | -0.8014642 |
| unityear_12_enrolments | -0.1087469 | -0.0620239 |
| year92:sectorGovernment | -0.0257979 | -0.0028113 |
| year92:sectorIndependent | 0.0132636 | 0.0383081 |
| sectorGovernment:unityear_12_enrolments | -0.0550680 | 0.0005954 |
| sectorIndependent:unityear_12_enrolments | -0.0385138 | 0.0211624 |
The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0, indicating that they’re different from 0 in the population (i.e. statistically significant).
Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]
Level two (schools within districts) will contain new predictor(sector)
\[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij} \end{aligned}\]
Level three (districts) \[\begin{aligned} \beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{03j} &= \gamma_{030} + r_{03j} \\ \beta_{10j} &= \gamma_{100} \\ \beta_{11j} &= \gamma_{110} \end{aligned}\]
Therefore, the composite model can be written as:
\[\begin{aligned} Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + r_{00j} + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + (\gamma_{030} + r_{03j})sector_{ij}unit_{ij} + u_{0ij} \right] + \\ & \left[\gamma_{100} + \gamma_{110}sector_{ij} + u_{1ij} \right]year92_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{030}sector_{ij}unit_{ij} +\gamma_{100}year92_{tij} + \gamma_{110} sector_{ij}year92_{tij} \right] + \\ & \left[r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + r_{03j}sector_{ij}unit_{ij} + u_{0ij} + u_{1ij}year92_{tij} + \epsilon_{tij} \right] \end{aligned}\]
summary(model_f)
## Groups Name Variance Std.Dev. Corr
## qcaa_district:qcaa_school_id (Intercept) 1.8467382 1.358947
## year92 0.0016073 0.040092 -0.729
## qcaa_district (Intercept) 0.2117140 0.460124
## Residual 0.1210798 0.347965
## Estimate Std. Error t value
## (Intercept) 3.429618908 0.202380825 16.9463629
## year92 0.009468131 0.004804577 1.9706483
## sectorGovernment 0.067490171 0.182073306 0.3706758
## sectorIndependent -1.246645798 0.198604067 -6.2770406
## unityear_12_enrolments -0.084742947 0.011705675 -7.2394753
## year92:sectorGovernment -0.013953478 0.005543237 -2.5172075
## year92:sectorIndependent 0.026088675 0.006094448 4.2807280
## sectorGovernment:unityear_12_enrolments -0.028631744 0.013514625 -2.1185748
## sectorIndependent:unityear_12_enrolments -0.009183082 0.015106467 -0.6078908
## Number of Level Two groups = 466
## Number of Level Three groups = 13
Based on the model output (see detailed explanation of fixed effects in step 9), the estimated mean enrolments for government school are expected to decrease by 0.4475% (\((e^{0.00946813 - 0.01395348} - 1) * 100\)) each year, which is 1.3857% (\((e^{- 0.01395348} - 1) * 100\)) less than that of catholic schools. On the other hand, independent schools are predicted to have an increase of 3.6197% (\((e^{0.00946813 + 0.02608867} - 1) * 100\)), which is 2.6432% more than that of catholic schools.
Figure 13: Fixed effects of the final model for Mathematical Methods
Based on the fixed effects (Figure 13), independent schools are expected to have the relatively highest increase in average enrolments across the years while government school showed a decrease in enrolments over the years. All sectors shows the year 11 enrolments are expected to be more than year 12 enrolments each year; This may indicate that students may be discontinuing the subject in year 11 after completing year 11 syllabus.
Figure 14: Random effects for all schools
Figure 14 represents the random intercept and slope of the random effects for a given school. It is manifest that the intercept and slope are negatively correlated (-0.73, as shown on the model summary output), where a large intercept is associated with a smaller random slope. This suggests that a larger school is associated with a smaller increase (decrease) in enrolments over the years while smaller schools are predicted to have large increase in enrolments over the years.
Figure 15: Random intercept for districts
As the random slopes are removed, all districts are predicted to have the same increase in enrolments over the years; And as was discussed previously, this was a reasonable assumption or an otherwise perfect correlation with random slope and intercept will be fitted. Figure 15 demonstrates that schools in Brisbane Central has the largest enrolments, on average.
Figure 16: Model predictions for 20 randomly selected schools
Figure 16 above shows the predictions for 20 randomly selected schools.
Figure 17: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Specialist Mathematics subject
With reference to the first step, Figure 17 fits a linear model for enrolments for a random sample of 20 schools. A myriad of patterns can be obtained from this. For instance, enrolments in school 2 (top left) appears to have significant larger increase in enrolments over the years the specialist mathematics have been introduced. Most schools showed relatively small enrolment numbers of less than 50, however school 2 and 175 showed rather large enrolment numbers for each cohort. It also demonstrates that each school can introduce (or end) specialist mathematics at different years – e.g. School 591 only introduced the subject in 2016, and School 113 introduced the subject in 1995 and discontinued the subject in 2005.
All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.
Figure 18: Effects of log transformation for response variable (enrolments) in Specialist Mathematics subject
As multilevel model assumes normality in the error terms, a log transformation is utilised to allow models to be estimated by the linear mixed models. The log transformation allows enrolment numbers to be approximately normally distributed (Figure 18).
| df | AIC | |
|---|---|---|
| Model0.2: Schools nested within districts | 4 | 25300.73 |
| Model0.1: Schools nested within postcodes | 4 | 25344.29 |
| Model0.0: Within schools | 3 | 25361.47 |
As per step 3, the three potential models are fitted, with the AIC shown in Table 9. Based on the AIC, model0.2, corresponding the schools nested within districts is the best model and will be used in the subsequent analysis.
summary(model0.2)
## Random effects:
## Groups Name Variance Std.Dev.
## qcaa_school_id:qcaa_district (Intercept) 0.42131 0.64908
## qcaa_district (Intercept) 0.11159 0.33405
## Residual 0.27720 0.52649
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.120037 0.09828461 21.57039
##
## Number of schools (level-two group) = 422
## Number of district (level-three group) = 13
This model will takes into account 422 schools nested in 13 districts. In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:
The level-two ICC relates to the correlation between school \(i\) from district \(k\) in time \(t\) and in time \(t^* \ne t\):
\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.4213}{(0.4213 + 0.1116 + 0.2772)} = 0.5201\]
This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 52.01% of the total variability is attributable to the changes overtime within schools.
The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific district \(j\) in time \(t\) and time \(t^* \ne t\).
\[\text{Level-three ICC} = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.1116}{(0.4213 + 0.1116 + 0.2772)} = 0.1378\]
Similarly, this can be conceptualised as the correlation between enrolments of two randomly selected schools from the same district – i.e. 11.70% of the total variability is due to the difference between districts.
The unconditional growth model introduces the time predictor at level one, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to linear changes over time. Furthermore, variability in intercepts and slopes can be obtained to compare schools within the same districts, and schools from different districts.
summary(model1.0)
## Groups Name Variance Std.Dev. Corr
## qcaa_district:qcaa_school_id (Intercept) 0.8214386 0.906332
## year92 0.0011262 0.033559 -0.654
## qcaa_district (Intercept) 0.0000000 0.000000
## year92 0.0002295 0.015149 NaN
## Residual 0.2092841 0.457476
## Estimate Std. Error t value
## (Intercept) 1.79689374 0.04801983 37.419825
## year92 0.01647784 0.00461389 3.571355
## Number of Level Two groups = 422
## Number of Level Three groups = 13
When the subject was first introduced in 1992, schools were expected to have 6.0309 (\(e^{1.7848}\)) enrolments, on average. This enrolments were rather low as there were only a small fraction of schools that offered the subject in 1992 (as demonstrated in Figure 2). On average, enrolments were expected to increase by 1.6614% (\((e^{0.0164778} - 1)\times100\)) per year. The estimated within-schools variance decreased by 24.45% (0.2772 to 0.2093), implying that 24.5% of within-school variability can be explained by the linear growth over time.
A singular fit is observed in the model as the correlation between the intercept and slope between districts are perfectly correlation (i.e. \(\phi_{01}\) = 1). This may suggest that the model is overfitted – i.e. the random effects structure is too complex to be supported by the data and may require some re-parameterisation. Naturally, the higher-order random effects (e.g. random slope of the third level (between district)) can be removed, especially where the variance and correlation terms are estimated on the boundaries (add bookdown reference).
summary(model1.1)
## Groups Name Variance Std.Dev. Corr
## qcaa_district:qcaa_school_id (Intercept) 0.8433365 0.918334
## year92 0.0012537 0.035407 -0.679
## qcaa_district (Intercept) 0.1201703 0.346656
## Residual 0.2093102 0.457504
## Estimate Std. Error t value
## (Intercept) 1.75561841 0.107914664 16.268581
## year92 0.01832492 0.001964398 9.328521
## Number of Level Two groups = 422
## Number of Level Three groups = 13
To elaborate, two parameters were removed by setting variance components \(\phi^2_{10}\) = \(\phi_{01}\) equal to zero Which indirectly assumes that the growth rate for district \(j\) to be fixed. As shown in the output above, this produced a more stable model and is free from any boundary constraints. As compared to the unconditional growth model (model1.0), the fixed effects remained rather similar.
Level one and level two will be identical to the unconditional growth model (model1.0), however, the random slope for level 3 will be removed. This implies that the error assumption at level three now follows a univariate normal distribution where \(r_{00j} \sim N(0, \phi_{00}^{2})\).
The new Level three (districts): \[\beta_{00j} = \gamma_{000} + r_{00j} \\ \beta_{10j} = \gamma_{100} \]
And therefore composite model: \[\begin{aligned} Y_{tij} &= \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ &= (\beta_{00j} + u_{0ij}) + (beta_{10j} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= (\gamma_{000} + r_{00j} + u_{0ij}) + (\gamma_{100} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= \left[\gamma_{000} + \gamma_{100}year92_{tij} \right] + \left[r_{00j} + u_{0ij} + u_{1ij}year92_{tij} + \epsilon_{tij} \right] \end{aligned}\]
| model | AIC |
|---|---|
| model4.4 | 22025.38 |
| model4.5 | 22027.29 |
| model4.7 | 22028.11 |
| model4.1 | 22030.06 |
| model4.0 | 22033.52 |
| model4.3 | 22044.92 |
| model4.2 | 22045.67 |
| model4.8 | 22045.67 |
| model4.10 | 22045.67 |
| model4.9 | 22045.67 |
| model4.6 | 22047.61 |
As highlighted in step 6, sector and unit will be added as predictors to the model. The largest possible model will be fitted, before removing fixed effects one by one while recording the AIC for each model. In this case, model4.0 corresponds to the largest possible model while model4.10 is the smallest possible model. The model with the optimal (lowest) AIC is model4.4 (Table 10). The next section will test the selected model’s random effects to build the final model.
This step will not be undertaken, as the random slope will not be included at level three as a boundary constraint was found in the unconditional growth model, indicating that the model will be overfitted if random slopes were included at level three.
| npar | AIC | BIC | logLik | deviance | Chisq | Df | Pr_boot(>Chisq) |
|---|---|---|---|---|---|---|---|
| 10 | 24312.80 | 24389.11 | -12146.40 | 24292.80 | NA | NA | NA |
| 12 | 22025.38 | 22116.96 | -11000.69 | 22001.38 | 2291.42 | 2 | 0 |
| var | 2.5 % | 97.5 % |
|---|---|---|
| sd_(Intercept)|qcaa_district:qcaa_school_id | 0.8475964 | 0.9886445 |
| cor_year92.(Intercept)|qcaa_district:qcaa_school_id | -0.7244002 | -0.5994735 |
| sd_year92|qcaa_district:qcaa_school_id | 0.0314348 | 0.0370023 |
| sd_(Intercept)|qcaa_district | 0.1917148 | 0.5116679 |
| sigma | 0.4517649 | 0.4620813 |
| (Intercept) | 1.5792349 | 2.1738440 |
| sectorGovernment | -0.2351881 | 0.2926828 |
| sectorIndependent | -0.7884265 | -0.1855885 |
| unityear_12_enrolments | -0.0559129 | -0.0286231 |
| year92 | 0.0058831 | 0.0231169 |
| sectorGovernment:year92 | -0.0122945 | 0.0079922 |
| sectorIndependent:year92 | 0.0074664 | 0.0310211 |
The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the fixed and random effects all exclude 0 (Table 12), indicating that they’re different from 0 in the population (i.e. statistically significant).
Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]
Level two (schools within districts) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij} \end{aligned}\]
Level three (districts) \[\begin{aligned}\beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{10j} &= \gamma_{100} \\ \beta_{11j} &= \gamma_{110} \end{aligned}\]
Therefore, the composite model can be written as \[\begin{aligned} Y_{tij} &= \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ &= (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= \left[\gamma_{000} + r_{00j} + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + u_{0ij} \right] + \left[\gamma_{100} + \gamma_{110}sector_{ij} + u_{1ij} \right]year92_{tij} + \epsilon_{tij} \\ &= \left[\gamma_{000} + \gamma_{010}sector_ {ij} + \gamma_{020}unit_{ij} + \gamma_{100}year92_{tij} + \gamma_{110}sector_{ij}year92_{tij} \right] + \left[r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + u_{0ij} + u_{1ij}year92_{tij} + + \epsilon_{tij} \right]\end{aligned}\]
summary(model_f)
## Groups Name Variance Std.Dev. Corr
## qcaa_district:qcaa_school_id (Intercept) 0.8370257 0.914891
## year92 0.0011733 0.034254 -0.666
## qcaa_district (Intercept) 0.1262182 0.355272
## Residual 0.2087225 0.456862
## Estimate Std. Error t value
## (Intercept) 1.88540434 0.150880944 12.4959740
## sectorGovernment 0.02044649 0.131787850 0.1551470
## sectorIndependent -0.48119714 0.148800988 -3.2338303
## unityear_12_enrolments -0.04230158 0.007419052 -5.7017495
## year92 0.01448120 0.004293656 3.3726968
## sectorGovernment:year92 -0.00194999 0.005033040 -0.3874377
## sectorIndependent:year92 0.01893009 0.005655238 3.3473547
## Number of Level Two groups = 422
## Number of Level Three groups = 13
Based on the model output (see detailed explanation of fixed effects in step 9), the estimated mean enrolments for government schools are estimated to be 2.06% (\((e^{0.02044} - 1) \times 100\)) more than that of catholic schools when the subject is first introduced in 1992 (i.e. larger initial status). Government schools are estimated to have a mean increase in enrolments of 1.2610% (\((e^{(0.01448 - 0.0019)} - 1) \times 100\)) per year, which is -0.1948% (\((e^(-0.00194999) - 1) * 100)\)) less than that of catholic schools.
On the other hand, independent schools have an estimated mean enrolments of 38% less than that of catholic schools when the subject is first introduced in 1992. However, this low initially low enrolments is matched with an a mean increase of 3.3976% per year, which is 1.9110% more than that of catholic schools per year.
Figure 19: Fixed effects of the final model for Specialist Mathematics subject
Based on the fixed effects, independent schools are expected to have highest log enrolments after 2020 (Figure 19. In all sectors, unit 11 enrolments appears to be marginally larger than unit 12, which may imply that students are taking the subject in year 10, so lesser students are enrolled in the subject in year 11. Government sectors appears to have the highest enrolment numbers initially (in year 1992), but have increases at a relatively slower rate as compared to the other sectors.
To be noted, these fixed effects considers all schools within the different sectors; Therefore, the many small government schools with little enrolments in the subject (as seen in Figure 4) may ‘down weight’ the mean fixed effects of all schools.
Figure 20: Random effects for all schools
Figure 20 represents the random intercept and slope of the random effects for a given school. It is a manifest that the intercept and slope are negatively correlated, where a large intercept is associated with a smaller random slope. This suggests that a larger school is associated with a smaller increase (decrease) in enrolments over the years while smaller schools are predicted to have large increase in enrolments over the years.
Figure 21: Random intercept for districts
As the random slopes are removed, all districts are predicted to have the same increase in enrolments over the years; And as was discussed previously, this was a reasonable assumption or an otherwise perfect correlation with random slope and intercept will be fitted. Figure 21 demonstrates that schools in Brisbane Central has the largest enrolments, on average.
Figure 22: Model predictions for 20 randomly selected schools
Figure 22 above shows the predictions for 20 randomly selected schools.
Figure 23: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for biology subject
As stated above, Essential Mathematics is one of the new subject introduced in the QCE system. Thus, there are only observations for three years, spanning from 2020 to 2022, as shown in Figure 23. The various sizes of schools can be seen in the plot, where schools 589, 631, 753 (bottom-right) have relatively low enrolments as compared to school 40 or school 88 (top-right). School 484 introduced the school in 2020, but discontinued offering the subject since.
The enrolments were right skewed, which is likely to be attributed to the various school sizes (as seen in Figure 23). A log transformation was implemented to the response variable (i.e. enrolments) to allow the the multilevel model to better capture the enrolment patterns.
| df | AIC | |
|---|---|---|
| Model0.2: Schools nested within districts | 4 | 3614.649 |
| Model0.0: Within schools | 3 | 3619.822 |
| Model0.1: Schools nested within postcodes | 4 | 3621.822 |
Referring back to step 3, three candidate models are fitted, with the AIC shown in Table 13. Model0.2, corresponding to having schools nested within districts is the best model, with optimised (lowest) AIC and will be used in the subsequent analysis.
## Random effects:
## Groups Name Variance Std.Dev.
## qcaa_school_id:qcaa_district (Intercept) 1.148377 1.07162
## qcaa_district (Intercept) 0.047705 0.21841
## Residual 0.113995 0.33763
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.841442 0.07919555 48.50577
##
## Number of schools (level-two group) = 458
## Number of district (level-three group) = 13
This model takes into account 458 schools nested in 13 districts. In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:
The level-two ICC relates to the correlation between school \(i\) from a certain district \(k\) in time \(t\) and in time \(t^*\):
\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{1.1484}{(1.1484 + 0.0477 + 0.1140)} = 0.8766\]
This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 87.66% of the total variability is attributable to the differences between schools from the same district rather than changes over time within schools.
The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific school \(j\).
\[\text{Level-three ICC} = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.0477}{(1.1484 + 0.0477 + 0.1140)} = 0.0364\]
Similarly, it can be inferred that the correlation between enrolments of two randomly selected schools from different districts are 3.64%, where the total variability can be attributed to the difference between districts.
## Groups Name Variance Std.Dev. Corr
## qcaa_district:qcaa_school_id (Intercept) 1.14444636 1.069788
## year20 0.04903558 0.221440 -0.088
## qcaa_district (Intercept) 0.04361994 0.208854
## year20 0.00082473 0.028718 0.230
## Residual 0.07323625 0.270622
## Estimate Std. Error t value
## (Intercept) 3.78369587 0.07738087 48.897045
## year20 0.05820918 0.01485465 3.918583
## Number of Level Two groups = 458
## Number of Level Three groups = 13
The unconditional growth model adds the systematic changes over time, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to the linear changes over time. Based on the model output:
Essential Mathematics was first introduced in 2020, and schools are expected to have 43.9785 (\(e^{3.7837}\)), on average. Furthermore, enrolments were expected to increase by 5.9937% (\(e^{0.0582092} - 1) \times 100\)) every year. The estimated within-school variance decrease by 35.7577% (0.1140 to 0.0732362), implying that 35.7577% of the within-school variability can be explained by the linear growth over time.
| model | npar | AIC | BIC | logLik |
|---|---|---|---|---|
| model4.0 | 19 | 2946.979 | 3058.468 | -1454.489 |
| model4.6 | 15 | 2951.546 | 3039.564 | -1460.773 |
| model4.2 | 14 | 2951.814 | 3033.964 | -1461.907 |
| model4.8 | 14 | 2951.814 | 3033.964 | -1461.907 |
| model4.9 | 14 | 2951.814 | 3033.964 | -1461.907 |
| model4.10 | 14 | 2951.814 | 3033.964 | -1461.907 |
| model4.1 | 17 | 2955.390 | 3055.143 | -1460.695 |
| model4.7 | 16 | 2955.666 | 3049.552 | -1461.833 |
| model4.3 | 13 | 2981.319 | 3057.601 | -1477.659 |
| model4.4 | 14 | 2985.158 | 3067.308 | -1478.579 |
| model4.5 | 15 | 2985.186 | 3073.204 | -1477.593 |
As detailed in step 6, level-two predictors (sector and unit) are added to the model. The largest possible model will be fitted, before removing each fixed effect one by one whilst recording the AIC for each model. model4.0 corresponds to the largest model while model4.10 is the smallest possible model. The model with the optimal (lowest) AIC is the largest possible model model4.0, and will be used in subsequent sections.
| npar | AIC | BIC | logLik | deviance | Chisq | Df | Pr_boot(>Chisq) |
|---|---|---|---|---|---|---|---|
| 17 | 2944.454 | 3044.208 | -1455.227 | 2910.454 | NA | NA | NA |
| 19 | 2946.979 | 3058.468 | -1454.489 | 2908.979 | 1.47565 | 2 | 0.256 |
Figure 24: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model
The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. Figure 24 displays the likelihood ratio test statistic from the null distribution, with the red line representing the likelihood ratio test statistic using the actual data.
The p-value of 25.6% (Table 15) indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. The large estimated \(p\)-value is 0.256 < 0.05 fails to reject the null hypothesis at the 5% level, indicating that the smaller model (without random slope at level three) is preferred.
| var | 2.5 % | 97.5 % |
|---|---|---|
| sd_(Intercept)|qcaa_district:qcaa_school_id | 0.7489372 | 0.8615391 |
| cor_year20.(Intercept)|qcaa_district:qcaa_school_id | -0.2253224 | -0.0015329 |
| sd_year20|qcaa_district:qcaa_school_id | 0.2044296 | 0.2476368 |
| sd_(Intercept)|qcaa_district | 0.1525776 | 0.4495489 |
| sigma | 0.2577367 | 0.2748534 |
| (Intercept) | 3.3929124 | 3.8944034 |
| year20 | 0.0209839 | 0.1432844 |
| sectorGovernment | 0.5992458 | 1.0092179 |
| sectorIndependent | -1.1713273 | -0.7385895 |
| unityear_12_enrolments | 0.0544900 | 0.1923526 |
| year20:sectorGovernment | -0.1098924 | 0.0381604 |
| year20:sectorIndependent | -0.0548805 | 0.1066265 |
| year20:unityear_12_enrolments | -0.1250023 | -0.0091553 |
| sectorGovernment:unityear_12_enrolments | -0.2852633 | -0.1176833 |
| sectorIndependent:unityear_12_enrolments | -0.0772452 | 0.1178977 |
| year20:sectorGovernment:unityear_12_enrolments | 0.0179056 | 0.1611691 |
| year20:sectorIndependent:unityear_12_enrolments | -0.0740500 | 0.0683533 |
The parametric bootstrap is utilised to construct confidence intervals (as detailed in step 8). If the confidence intervals for the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The 95% confidence interval is shown above (Table 16), and the random effects all exclude 0, further reiterating that they are statistically significant at the 5% level.
Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year20_{tij} + \epsilon_{tij}\]
Level two (schools within districts) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{1ij} \end{aligned}\]
Level three (districts) \[\begin{aligned} \beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{03j} &= \gamma_{030} + r_{03j} \\ \beta_{10j} &= \gamma_{100} \\ \beta_{11j} &= \gamma_{110} \\ \beta_{12j} &= \gamma_{120} \\ \beta_{13j} &= \gamma_{130} \end{aligned}\]
Therefore, the composite model can be written as
\[\begin{aligned} Y_{tij} =\ &\pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij}) + \\ & (\beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + \beta_{13j}sector_{ij}unit_{ij} + u_{1ij})year20_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + r_{00j} + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + (\gamma_{030} + r_{03j})sector_{ij}unit_{ij} + u_{0ij} \right] + \\ & \left[\gamma_{100} + \gamma_{110}sector_{ij} + \gamma_{120}unit_{ij} + \gamma_{130}sector_{ij}unit_{ij} + u_{1ij} \right]year20_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{030}sector_{ij}unit_{ij} + \gamma_{100}year20_{tij} + \gamma_{110}sector_{ij}year_{tij} + \gamma_{120}unit_{ij}year_{tij} + \gamma_{130}sector_{ij}unit_{ij}year_{tij} \right] \\ & \left[r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + r_{03j}sector_{ij}unit_{ij} + u_{0ij} + u_{1ij}year20_{tij} + \epsilon_{tij} \right] \end{aligned}\]
Figure 25: Fixed effects of the final model for Essential Mathematics
With a three-way interaction, it is easier to visualise the fixed effects (Figure 25). All sectors demonstrated the same trend with units, where unit 11 showed a larger increase in enrolments, on average. This implies that over the years, more students are enrolling in unit 11 than unit 12.
Figure 26: Random effects for schools
As shown in the model output and in Figure 26, there was little correlation (-0.12) between the random intercept and slope. Schools with appear to decrease or increase in enrolments at different rates; This may be attributed to the fact that the data only consists of observations for three years and there is not enough data to evaluate the enrolment patterns in schools.
Figure 27: Random intercept for districts
As the random slopes are removed, all districts are predicted to have the same increase in enrolments over the years; And as was discussed previously, this was a reasonable assumption or an otherwise perfect correlation with random slope and intercept will be fitted. Figure 27 demonstrates that schools in Gold Coast has the largest enrolments while Mackay have the lowest enrolments in Essential Mathematics subject, on average.
Figure 28: Model predictions for 20 randomly selected schools
Figure 28 above shows the predictions for 20 randomly selected schools.
Figure 29: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for General Mathematics subject
With reference to the first step, Figure 29 fits a linear model for enrolments for a random sample of 20 schools. Various school sizes can be seen, school 633 and 634 (bottom-right) appears to have approximately 10 enrolments per cohort, while school 25 and 34 have enrolments above 200 per cohort. Some schools showed relatively large increase in enrolments over the years, while some showed a decrease (e.g. school 41).
All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.
Figure 30: Effects of log transformation for response variable (enrolments) in General Mathematics
The enrolments were right skewed, which is likely to be attributed to the various school sizes (as seen in Figure 30). A log transformation was implemented to the response variable (i.e. enrolments) to allow the the multilevel model to better capture the enrolment patterns.
| df | AIC | |
|---|---|---|
| Model0.2: Schools nested within districts | 4 | 24964.05 |
| Model0.0: Within schools | 3 | 24979.49 |
| Model0.1: Schools nested within postcodes | 4 | 24981.49 |
As per Step 3, the three potential models are fitted, with the AIC shown in Table 17. Based on the AIC, model0.2, corresponding the schools nested within districts is the best model and will be used in the subsequent analysis.
## Random effects:
## Groups Name Variance Std.Dev.
## qcaa_school_id:qcaa_district (Intercept) 0.891424 0.94415
## qcaa_district (Intercept) 0.069765 0.26413
## Residual 0.174806 0.41810
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.68386 0.08530786 43.18313
##
## Number of schools (level-two group) = 481
## Number of district (level-three group) = 13
This model will takes into account 481 schools nested in 13 districts. In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:
The level-two ICC relates to the correlation between school \(i\) from district \(k\) in time \(t\) and in time \(t^* \ne t\):
\[ICC(school) = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.89142}{(0.89142 + 0.06976 + 0.17481)} = 0.7847\] This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 78.47% of the total variability is attributable to the changes overtime within schools.
The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific district \(j\) in time \(t\) and time \(t^* \ne t\).
\[ICC(school) = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.1077}{(0.5888 + 0.1077 + 0.2094)} = 0.0614\] Similarly, this can be conceptualised as the correlation between enrolments of two randomly selected schools from the same district – i.e. 6.14% of the total variability is due to the difference between districts.
The unconditional growth model introduces the time predictor at level one, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to linear changes over time. Furthermore, variability in intercepts and slopes can be obtained to compare schools within the same districts, and schools from different districts.
## Groups Name Variance Std.Dev. Corr
## qcaa_district:qcaa_school_id (Intercept) 3.7608e+00 1.9392756
## year92 3.2360e-03 0.0568856 -0.869
## qcaa_district (Intercept) 1.7958e-02 0.1340077
## year92 5.8655e-05 0.0076587 1.000
## Residual 9.8464e-02 0.3137898
## Estimate Std. Error t value
## (Intercept) 2.91024486 0.098867525 29.43580
## year92 0.03476855 0.003475268 10.00457
## Number of Level Two groups = 481
## Number of Level Three groups = 13
When the subject was first introduced in 1992, schools were expected to have 18.3613 (\(e^{2.9102449}\)) enrolments, on average. On average, enrolments were expected to increase by 3.5380% (\((e^{0.0347686} - 1)\times100\)) per year. The estimated within-schools variance decreased by 90.16% (0.17481 to 0.0984), implying that 90.16% of within-school variability can be explained by the linear growth over time.
A singular fit is observed in the model as the correlation between the intercept and slope between districts are perfectly correlation (i.e. \(\phi_{01}\) = 1). This may suggest that the model is overfitted – i.e. the random effects structure is too complex to be supported by the data and may require some re-parameterisation. Naturally, the higher-order random effects (e.g. random slope of the third level (between district)) can be removed, especially where the variance and correlation terms are estimated on the boundaries (add bookdown reference).
## Groups Name Variance Std.Dev. Corr
## qcaa_district:qcaa_school_id (Intercept) 3.8040956 1.950409
## year92 0.0033141 0.057568 -0.870
## qcaa_district (Intercept) 0.1242578 0.352502
## Residual 0.0984432 0.313757
## Estimate Std. Error t value
## (Intercept) 2.8959781 0.134506478 21.53040
## year92 0.0353323 0.002779433 12.71205
## Number of Level Two groups = 481
## Number of Level Three groups = 13
To elaborate, two parameters were removed by setting variance components \(\phi^2_{10}\) = \(\phi_{01}\) equal to zero Which indirectly assumes that the growth rate for district \(j\) to be fixed. As shown in the model output above, this produced a more stable model and is free from any boundary constraints. As compared to the unconditional growth model (model1.0), the fixed effects remained rather similar.
Level one and level two will be identical to the unconditional growth model (model1.0), however, the random slope for level 3 will be removed. This implies that the error assumption at level three now follows a univariate normal distribution where \(r_{00j} \sim N(0, \phi_{00}^{2})\).
The new Level three (districts): \[\beta_{00j} = \gamma_{000} + r_{00j} \\ \beta_{10j} = \gamma_{100} \]
And therefore composite model: \[\begin{aligned} Y_{tij} &= \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ &= (\beta_{00j} + u_{0ij}) + (beta_{10j} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= (\gamma_{000} + r_{00j} + u_{0ij}) + (\gamma_{100} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ &= \left[\gamma_{000} + \gamma_{100}year92_{tij} \right] + \left[r_{00j} + u_{0ij} + u_{1ij}year92_{tij} + \epsilon_{tij} \right] \end{aligned}\]
| model | AIC |
|---|---|
| model4.0 | 15044.37 |
| model4.1 | 15051.92 |
| model4.7 | 15083.63 |
| model4.5 | 15133.73 |
| model4.6 | 15143.43 |
| model4.4 | 15173.32 |
| model4.9 | 15175.07 |
| model4.2 | 15175.07 |
| model4.8 | 15175.07 |
| model4.10 | 15175.07 |
| model4.3 | 15226.17 |
As highlighted in step 6, sector and unit will be added as predictors to the model. The largest possible model will be fitted, before removing fixed effects one by one while recording the AIC for each model. In this case, model4.0 corresponds to the largest possible model while model4.10 is the smallest possible model. The model with the optimal (lowest) AIC is model4.4 (Table 18). The next section will test the selected model’s random effects to build the final model.
This step will not be undertaken, as the random slope will not be included at level three as a boundary constraint was found in the unconditional growth model, indicating that the model will be overfitted if random slopes were included at level three.
| var | 2.5 % | 97.5 % |
|---|---|---|
| sd_(Intercept)|qcaa_district:qcaa_school_id | 1.5242174 | 1.7454448 |
| cor_year92.(Intercept)|qcaa_district:qcaa_school_id | -0.8618092 | -0.8004080 |
| sd_year92|qcaa_district:qcaa_school_id | 0.0460150 | 0.0533202 |
| sd_(Intercept)|qcaa_district | 0.1979323 | 0.5344465 |
| sigma | 0.3098662 | 0.3158212 |
| (Intercept) | 2.6985826 | 3.5356257 |
| year92 | 0.0288684 | 0.0511286 |
| sectorGovernment | 0.2532983 | 1.0794554 |
| sectorIndependent | -2.1037594 | -1.1831583 |
| unityear_12_enrolments | -0.0161797 | 0.0802241 |
| year92:sectorGovernment | -0.0434640 | -0.0178734 |
| year92:sectorIndependent | 0.0139084 | 0.0424983 |
| year92:unityear_12_enrolments | -0.0018883 | 0.0030005 |
| sectorGovernment:unityear_12_enrolments | -0.2128654 | -0.0980218 |
| sectorIndependent:unityear_12_enrolments | -0.0886348 | 0.0386853 |
| year92:sectorGovernment:unityear_12_enrolments | 0.0011711 | 0.0070114 |
| year92:sectorIndependent:unityear_12_enrolments | -0.0021703 | 0.0038465 |
The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0 (Table 19), indicating that they’re different from 0 in the population (i.e. statistically significant).
Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]
Level two (schools within districts) will contain new predictor(sector)
\[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + \beta_{13j}sector_{ij}unit_{ij} + u_{1ij} \end{aligned}\]
Level three (districts) \[\begin{aligned}\beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{03j} &= \gamma_{030} + r_{03j} \\ \beta_{10j} &= \gamma_{100} \\ \beta_{11j} &= \gamma_{110} \\ \beta_{12j} &= \gamma_{120} \\ \beta_{13j} &= \gamma_{130}\end{aligned}\]
The composite model can therefore be written as: \[\begin{aligned} Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + \beta_{13j}sector_{ij}unit_{ij} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + r_{00j} + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + (\gamma_{030} + r_{03j})sector_{ij}unit_{ij} + u_{0ij} \right] + \\ & \left[\gamma_{100} + \gamma_{110}sector_{ij} + \gamma_{120}unit_{ij} + \gamma_{130}sector_{ij}unit_{ij} + u_{1ij}\right]year92_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{030}sector_{ij}unit_{ij} + \gamma_{100}year92_{tij} + \gamma_{110}year92_{tij}sector_{ij} + \gamma_{120}unit_{ij}year92_{tij} + \gamma_{130}sector_{ij}unit_{ij}year92_{tij} \right] + \\ & \left[r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + r_{03j}sector_{ij}unit_{ij} + u_{0ij]} + u_{1ij}year92_{tij} + \epsilon_{tij} \right] \end{aligned}\]
Figure 31: Fixed effects of the final model for General Mathematics subject
With a three-way interaction, it is easier to visualise the fixed effects (Figure 31). When the subject was first introduced in 1992, independent schools appears to have the least enrolments, on average. This low enrolment number was matched with a relatively large increase in enrolments per year, as seen by the slope. Although government schools have high enrolments initially, the rate of change in enrolments over the years increased relatively slow compared to the other sectors. Year 11 and year 12 units have similar enrolment numbers, on average.
Figure 32: Random effects for all schools
A large negative correlation between the random intercept and slope at the school level is apparent (Figure 32. This suggests that a larger school is associated with a smaller increase (decrease) in enrolments over the years while smaller schools are predicted to have large increase in enrolments over the years.
Figure 33: Random intercept for districts
As the random slopes are removed, all districts are predicted to have the same increase in enrolments over the years; And as was discussed previously, this was a reasonable assumption or an otherwise perfect correlation with random slope and intercept will be fitted. Figure 33 demonstrates that schools in Gold Coast has the largest enrolments, on average.
Figure 34: Model predictions for year 11 enrolments for 20 randomly selected schools
Figure 34 above shows the predictions for 20 randomly selected schools.
Figure 35: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Agricultural Practices
As described in step 1, a basic linear model was plotted for each school to provide insights of the enrolment trends for each school. Figure 35 demonstrates a distinct feature, in which there was a halt in the subject from 2000 to 2014. Schools 272 and 278 and 304 appers to re-introduce subject again after the halt, while some schools introduced the subject in 2015. Some enrolment patterns are rather erratic, such as having only one cohort taking the subject (school 613 and 636) or having a start increase in enrolment in one particular year (school 698).
Figure 36: Time plot of enrolments, demonstrating a halt in the subject from 2000 to 2014
A halt in Agricultural practices subject in 2000, before being re-introduced again in 2015 (Figure 36. Enrolments before year 2000 were rather small, as compared to enrolments from 2015 onwards. Therefore, enrolments before year 2000 will be removed.
As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. For these reasons, all completion years with zero enrolments will also be removed for modelling.
Figure 37: Effects of log transformation for response variable (enrolments) in Agricultural Practices subject
As multilevel model assumes normality in the error terms, a log transformation is utilised to allow models to be estimated by the linear mixed models. The log transformation allows enrolment numbers to be approximately normally distributed (Figure 37).
| df | AIC | |
|---|---|---|
| Model0.0: Within schools | 3 | 1573.659 |
| Model0.2: Schools nested within districts | 4 | 1574.922 |
| Model0.1: Schools nested within postcodes | 4 | 1575.659 |
As underlined in step 3, the three candidate models are fitted and their AIC is shown in Table 20. Based on the AIC, the two-level model (model0.0) is the best model and will be used in the subsequent analysis.
summary(model0.0)
## Random effects:
## Groups Name Variance Std.Dev.
## qcaa_school_id (Intercept) 0.48659 0.69756
## Residual 0.39251 0.62651
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.647161 0.07777087 34.03796
##
## Number of schools (level-two group) = 93
## Number of district (level-three group) = NA
The level-two ICC is the correlation between a school \(i\) in time \(t\) and time \(t^*\):
\[ \text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.4866}{(0.4866 + 0.3925)} = 0.5535\]
This can be conceptualised as the correlation between the enrolments of a selected school at two randomly drawn year (i.e. two randomly selected cohort from the same school). In other words, 55.35% of the total variability is attributable to the differences in enrolments within schools at different time periods.
summary(model1.0)
## Groups Name Variance Std.Dev. Corr
## qcaa_school_id (Intercept) 0.550453 0.741925
## year15 0.006227 0.078912 -0.402
## Residual 0.257708 0.507650
## Estimate Std. Error t value
## (Intercept) 2.0127475 0.09392102 21.43021
## year15 0.1514085 0.01398980 10.82278
## Number of Level Two groups = 93
## Number of Level Three groups = NA
The next step involves incorporating the linear growth of time into the model. The model output is shown above.
When the subject was re-introduced into the QCE system in 2015, schools were expected to have 7.4842 (\(e^{2.0128}\)) enrolment. On average, enrolments were expected to increase by a staggering 16.35% (\((e^{0.151408} - 1)\times100\)) per year. The estimated within-schools variance decreased by 34.34% (0.3925 to 0.2577), implying that 34.34% of within-school variability can be explained by the linear growth over time.
| model | npar | AIC | BIC | logLik |
|---|---|---|---|---|
| model4.5 | 12 | 1327.094 | 1381.961 | -651.5468 |
| model4.4 | 11 | 1328.231 | 1378.526 | -653.1157 |
| model4.3 | 10 | 1329.521 | 1375.244 | -654.7606 |
| model4.1 | 14 | 1330.494 | 1394.506 | -651.2470 |
| model4.7 | 13 | 1331.572 | 1391.012 | -652.7862 |
| model4.6 | 12 | 1333.059 | 1387.926 | -654.5293 |
| model4.0 | 16 | 1333.719 | 1406.875 | -650.8593 |
| model4.9 | 11 | 1334.117 | 1384.412 | -656.0585 |
| model4.2 | 11 | 1334.117 | 1384.412 | -656.0585 |
| model4.8 | 11 | 1334.117 | 1384.412 | -656.0585 |
| model4.10 | 11 | 1334.117 | 1384.412 | -656.0585 |
As outlined in step 6, sector and unit will be added as predictors to the model. The largest possible model (model4.0) will then be fitted, before removing the fixed effects one at a time (with model4.10 being the smallest of all 10 candidate models), while recording the AIC for each model. model4.5 appears to have the optimal (smallest) AIC (Table 21), and will be used in the next section to build the final model.
| npar | AIC | BIC | logLik | deviance | Chisq | Df | Pr_boot(>Chisq) |
|---|---|---|---|---|---|---|---|
| 10 | 1330.958 | 1376.681 | -655.4792 | 1310.958 | NA | NA | NA |
| 12 | 1327.094 | 1381.961 | -651.5468 | 1303.094 | 7.864871 | 2 | 0.009 |
Figure 38: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model
The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. Figure 38 displays the likelihood ratio test statistic from the null distribution, with the red line indicates the likelihood ratio test statistic using the actual data.
The p-value indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. The low estimated \(p\)-value is 0.009 < 0.05 (Table 22) rejects the null hypothesis at the 5% level, indicating that the larger model (including random slope at level two) is preferred.
| var | 2.5 % | 97.5 % |
|---|---|---|
| sd_(Intercept)|qcaa_school_id | 0.5410851 | 0.8275173 |
| cor_year15.(Intercept)|qcaa_school_id | -0.6286851 | 0.1177694 |
| sd_year15|qcaa_school_id | 0.0360913 | 0.0971561 |
| sigma | 0.4719375 | 0.5290075 |
| (Intercept) | 1.5118458 | 2.6445580 |
| year15 | -0.0251663 | 0.1676654 |
| sectorGovernment | -0.3715192 | 0.7707218 |
| sectorIndependent | -1.3320193 | 0.2141525 |
| unityear_12_enrolments | -0.4283373 | -0.1388208 |
| year15:sectorGovernment | -0.0312332 | 0.1723089 |
| year15:sectorIndependent | 0.0272586 | 0.2881371 |
| year15:unityear_12_enrolments | -0.0002587 | 0.0607488 |
The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0 (Table 23), indicating that they’re different from 0 in the population (i.e. statistically significant).
Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year15_{tij} + \epsilon_{tij}\]
Level two (schools within postcodes) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + u_{1ij} \end{aligned}\]
Therefore, the composite model can be written as
\[\begin{aligned}Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year15_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + u_{1ij})year15_{tij} + \epsilon_{tij} \\ =\ & \left[\beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + \beta_{10j}year15_{tij} + \beta_{11j}sector_{ij}year15_{tij} + \beta_{12j}unit_{ij}year15_{tij} \right] \left[u_{0ij} + u_{1ij} + \epsilon_{tij} \right] \end{aligned}\]
summary(model_f)
## Groups Name Variance Std.Dev. Corr
## qcaa_school_id (Intercept) 0.4842622 0.695890
## year15 0.0048578 0.069698 -0.337
## Residual 0.2509081 0.500907
## Estimate Std. Error t value
## (Intercept) 2.03638008 0.29670877 6.863228
## year15 0.06877268 0.04919228 1.398038
## sectorGovernment 0.22111834 0.31239875 0.707808
## sectorIndependent -0.53317803 0.37319648 -1.428679
## unityear_12_enrolments -0.27868693 0.07555397 -3.688581
## year15:sectorGovernment 0.06990061 0.05093108 1.372455
## year15:sectorIndependent 0.14926780 0.06131800 2.434323
## year15:unityear_12_enrolments 0.02866851 0.01629460 1.759388
## Number of Level Two groups = 93
## Number of Level Three groups = NA
Based on the model output, the estimated mean enrolments for government schools are estimated to be 24.75% (\(e^{0.2211} - 1 \times 100\)) more than that of catholic schools when the subject was re-introduced in 2015. Government schools are also estimated to have an estimated increase in mean enrolments of 14.87% (\((e^{0.0687727 + 0.0699006} - 1) \times 100\)) per year, which is 7.24% more than that of catholic schools.
On the other hand, independent schools are estimated to have mean enrolments of 41.32% (\((e^{-0.5331780} - 1) \times 100\)) less than that of catholic schools in 2015. This low initial status is matched with a relatively large increase of 24.36% in mean enrolments per year, which is a staggering 16.10% more than that of catholic schools, on average.
Figure 39: Fixed effects of the final model for Agricultural Practices subject
Based on the fixed effects (see step 9 for detailed explanation), independent schools show the greatest rate of change (as indicated by the slope in Figure 39), indicating that enrolment numbers are increasing at the highest rate as compare to all other sectors. For all sectors, unit 12 appears to have a larger slope than that of unit 11, indicating that there are increasingly more unit 12 enrolments than unit 11. It appears that after 2025, unit 12 enrolments are generally more than that of unit 11. Catholic showed a rather slow increase in enrolments relatively to the other sectors.
Figure 40: Random effects for all schools
Figure 40 represents the random intercept and slope of the random effects for all 93 schools who offered Agricultural Practices subject. There are no distinct correlations between the random intercept and slope.
Figure 41: Model predictions for year 11 enrolments for 20 randomly selected schools
Figure 41 above shows the predictions for 20 randomly selected schools.
Figure 42: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Agricultural Science subject
Figure 42 fits a linear model for 20 randomly selected schools. In most of these selected schools, the linear model captured a downward trend in enrolments. Some schools showed a increase, and in particular, school 288 showed a significant increase in enrolments as compared to the other selected schools. It seems that most schools that ceased offering the subject in the later years (e.g. schools 365, 490, 567) showed a large downward trend before ceasing the subject. Interestingly, school 508 displayed a rather cubic trend, where enrolments increases exponentially from 2002 before decrease till 2015 and picking up again.
All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.
Figure 43: Effects of log transformation for response variable (enrolments) in Agricultural Science subject
The enrolments were right skewed, which is likely to be attributed to the various school sizes (as seen in Figure 43). A log transformation was implemented to the response variable (i.e. enrolments) to allow the the multilevel model to better capture the enrolment patterns.
| df | AIC | |
|---|---|---|
| Model0.1: Schools nested within postcodes | 4 | 5146.230 |
| Model0.0: Within schools | 3 | 5149.113 |
| Model0.2: Schools nested within districts | 4 | 5149.856 |
As outlined in step 3, the three candidate models are fitted and their AIC is shown in Table 24. Based on the AIC, the two-level model (model0.1) corresponding to schools nested within postcodes is the superior model and will be used in the subsequent analysis.
summary(model0.1)
## Random effects:
## Groups Name Variance Std.Dev.
## qcaa_school_id:school_postcode (Intercept) 0.40515 0.63651
## school_postcode (Intercept) 0.12275 0.35036
## Residual 0.29180 0.54018
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.018581 0.07739906 26.08017
##
## Number of schools (level-two group) = 112
## Number of school postcodes (level-three group) = 81
This model will takes into account 112 schools nested in 81 postcodes. In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:
The level-two ICC relates to the correlation between school \(i\) from a certain postcode \(k\) in time \(t\) and in time \(t^* \ne t\):
\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.4052}{(0.4052 + 0.1228 + 0.2918)} = 0.4943\]
This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 49.45% of the total variability is attributable to the changes over time within schools.
The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific postcode \(j\).
\[\text{Level-three ICC} = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.1228}{(0.4052 + 0.1228 + 0.2918)} = 0.1498\]
Likewise, this can be loosely translated to be the correlation between enrolments of two random draws from two schools from two different postcode. In this case, 14.98% of the total variability is due to the difference between postcodes.
The unconditional growth model introduces the time predictor at level one, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to linear changes over time. Furthermore, variability in intercepts and slopes can be obtained to compare schools within the same postcodes, and schools from different postcodes.
summary(model1.0)
## Groups Name Variance Std.Dev. Corr
## school_postcode:qcaa_school_id (Intercept) 1.0049220 1.002458
## year94 0.0013985 0.037397 -0.906
## school_postcode (Intercept) 0.1565287 0.395637
## year94 0.0002277 0.015090 -0.669
## Residual 0.2539595 0.503944
## Estimate Std. Error t value
## (Intercept) 1.83876433 0.119841404 15.343314
## year94 0.01227337 0.004831677 2.540188
## Number of Level Two groups = 112
## Number of Level Three groups = 81
When the subject was first introduced in 1994, schools were expected to have 6.2888 (\(e^{1.8388}\)) enrolments, on average. As displayed in Figure 3, the average enrolments for agricultural science is relatively low, where the enrolments during the old QCE system had approximately 10 enrolments while the new QCE system had roughly 15 enrolments per school, on average.
Enrolments were expected to increase by 1.2349% (\((e^{0.01227} - 1)\times100\)) per year. The estimated within-schools variance decreased by 12.968% (0.2918 to 0.2539595), implying that 12.968% of within-school variability can be explained by the linear growth over time.
| model | npar | AIC | BIC | logLik |
|---|---|---|---|---|
| model4.3 | 13 | 4841.696 | 4919.622 | -2407.848 |
| model4.6 | 15 | 4841.897 | 4931.812 | -2405.949 |
| model4.5 | 15 | 4843.214 | 4933.129 | -2406.607 |
| model4.1 | 17 | 4843.639 | 4945.542 | -2404.819 |
| model4.0 | 19 | 4847.303 | 4961.195 | -2404.651 |
| model4.10 | 14 | 4848.524 | 4932.444 | -2410.262 |
| model4.9 | 14 | 4848.524 | 4932.444 | -2410.262 |
| model4.2 | 14 | 4848.524 | 4932.444 | -2410.262 |
| model4.8 | 14 | 4848.524 | 4932.444 | -2410.262 |
| model4.4 | 14 | 4850.301 | 4934.221 | -2411.151 |
| model4.7 | 16 | 4850.361 | 4946.270 | -2409.181 |
As outlined in step 6, sector and unit will be added as predictors to the model. The largest possible model (model4.0) will then be fitted, before removing the fixed effects one at a time (with model4.10 being the smallest of all 10 candidate models), while recording the AIC for each model. model4.3 appears to have the optimal (smallest) AIC (Table 25), and will be used in the next section to build the final model.
| npar | AIC | BIC | logLik | deviance | Chisq | Df | Pr_boot(>Chisq) |
|---|---|---|---|---|---|---|---|
| 11 | 4837.855 | 4903.793 | -2407.928 | 4815.855 | NA | NA | NA |
| 13 | 4841.696 | 4919.622 | -2407.848 | 4815.696 | 0.1593719 | 2 | 0.718 |
Figure 44: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model
The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. Figure 44 displays the likelihood ratio test statistic from the null distribution, with the red line indicates the likelihood ratio test statistic using the actual data.
The p-value of 0.718% (Table 26) indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. The large estimated \(p\)-value is 0.718 < 0.05 fails to reject the null hypothesis at the 5% level, indicating that the smaller model (without random slope at level three) is preferred.
| var | 2.5 % | 97.5 % |
|---|---|---|
| sd_(Intercept)|school_postcode:qcaa_school_id | 0.8530526 | 1.2182627 |
| cor_year94.(Intercept)|school_postcode:qcaa_school_id | -0.9690616 | -0.8565112 |
| sd_year94|school_postcode:qcaa_school_id | 0.0315056 | 0.0460868 |
| sd_(Intercept)|school_postcode | 0.0000009 | 0.4581153 |
| sigma | 0.4876907 | 0.5140716 |
| (Intercept) | 1.4227924 | 2.2504155 |
| year94 | 0.0005497 | 0.0184106 |
| sectorGovernment | -0.3117705 | 0.4153108 |
| sectorIndependent | -0.0628937 | 0.7950850 |
| unityear_12_enrolments | -0.2747165 | -0.1289871 |
| year94:unityear_12_enrolments | 0.0026651 | 0.0107254 |
The parametric bootstrap is utilised to construct confidence intervals (as detailed in step 8). If the confidence intervals for the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level.
The 95% confidence interval is shown above, and the random effects all exclude 0, further reiterating that they are statistically significant at the 5% level. Some fixed effects such as unityear_12_enrolments were insignificant, suggesting that there were no differences between unit 11 and unit 12 units.
Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year94_{tij} + \epsilon_{tij}\]
Level two (schools within postcodes) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}unit_{ij} + u_{1ij} \end{aligned}\]
Level three (postcodes) \[\begin{aligned}\beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{10j} &= \gamma_{100} \\ \beta_{11j} &= \gamma_{110} \end{aligned}\]
Therefore, the composite model can be written as:
\[\begin{aligned} Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year94_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}unit_{ij} + u_{1ij})year94_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + r_{00j} + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + u_{0ij} \right] + \left[\gamma_{100} + \gamma_{110}unit_{ij} + u_{1ij}\right]year94_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{0000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{100}year94_{tij} + \gamma_{110}unit_{ij}year94_{tij} \right] + \left[r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + u_{0ij} + u_{1ij}year94_{tij} + \epsilon_{tij} \right] \end{aligned}\] ### Fixed effects
summary(model_f)
## Groups Name Variance Std.Dev. Corr
## school_postcode:qcaa_school_id (Intercept) 1.0717490 1.035253
## year94 0.0015528 0.039406 -0.916
## school_postcode (Intercept) 0.0812513 0.285046
## Residual 0.2507071 0.500707
## Estimate Std. Error t value
## (Intercept) 1.821947999 0.198742787 9.1673667
## year94 0.009460629 0.004666531 2.0273366
## sectorGovernment 0.048533174 0.176743531 0.2745966
## sectorIndependent 0.364311290 0.196685041 1.8522572
## unityear_12_enrolments -0.204347826 0.037573894 -5.4385587
## year94:unityear_12_enrolments 0.006683023 0.002232326 2.9937484
## Number of Level Two groups = 112
## Number of Level Three groups = 81
Notably, the model assumes that enrolments in different postcodes are assumed to increase at the same rate (as justified by the parametric bootstrap while testing for random effects). Using the model output above (see step 9 for detailed explanation on fixed effects), the estimated increase in mean enrolments for schools between postcodes are estimated to increase at a rate of 0.9506% per year.
Figure 45: Fixed effects of the final model for Agricultural Science
Based on the fixed effects, independent schools are expected to have the highest enrolments over the years. In all sectors, unit 12 have lower initial status (i.e. lower enrolments when subject is first introduced), however, they are expected to increase at a much higher rate relative to unit 11 units. Intuitively, this suggests that on average, there may be more students taking this subject in year 10, before re-enrolling again in year 12, or in the more extreme case, students are dropping the subject after completing year 11.
A clear negative correlation in the random intercept and slope can be distinguished, indicating that schools with larger enrolments when the subject was first introduced are expected to show a smaller increase (decrease) in enrolments over the years. Based on the model output, the correlation between random intercept and slope was -0.92.
Figure 46: Random intercept for districts
As there is no random slope, enrolments for the across all postcodes are estimated to be the same (justified by the parametric bootstrap). As shown in Figure 46, most of the variations are accounted for in the random intercept, which suggests that some postcodes are associated with larger schools (and enrolments) relative to other postcodes. Schools within postcode 4350 (In Toowoomba district) are predicted to have the highest initial status while schools within postcode 4570 (Wide Bay district) are predicted to have the lowest initial status.
Figure 47: Model predictions for 20 randomly selected schools
Figure 48: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Aquatic Practices
As shown previously (Figure 3), Aquatic Practices is a relatively new subject introduced in 2015. The plot above (Figure 48) displays a linear model fitted for each school (explained in step 1), primarily to provide an at-a-glance visualisation to the enrolment trends in each school.
In the randomly selected schools, the various school sizes is distinct, where there were relatively larger schools such as school 72 and 442, which showed enrolments of over 100 students while schools 570 and 579 consistently had enrolments below 50 for each year. Some schools showed a stark increase in enrolments (e.g. school 72 and 442), while some school showed rather constant growth enrolments (e.g. School 493 and 517).
As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. For these reasons, all completion years with zero enrolments will also be removed for modelling.
Figure 49: Effects of log transformation for response variable (enrolments) in Aquatic Practices
As multilevel model assumes normality in the error terms, a log transformation is utilised to allow models to be estimated by the linear mixed models. The log transformation allows enrolment numbers to be approximately normally distributed (Figure 49).
| df | AIC | |
|---|---|---|
| Model0.0: Within schools | 3 | 2505.504 |
| Model0.2: Schools nested within districts | 4 | 2506.955 |
| Model0.1: Schools nested within postcodes | 4 | 2507.105 |
As underlined in step 3, the three candidate models are fitted and their AIC is shown in Table 28. Based on the AIC, the two-level model (model0.0) is the best model and will be used in the subsequent analysis.
summary(model0.0)
## Random effects:
## Groups Name Variance Std.Dev.
## qcaa_school_id (Intercept) 0.69382 0.83296
## Residual 0.38336 0.61916
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.308375 0.07911121 41.8193
##
## Number of schools (level-two group) = 120
## Number of district (level-three group) = NA
This model takes into account 120 schools. For a two-level multilevel model, the level two intraclass correlation coefficient (ICC) can be computed using the model output above.
The level-two ICC is the correlation between a school \(i\) in time \(t\) and time \(t^*\):
\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.6938}{(0.6938 + 0.3834)} = 0.6441\]
This can be conceptualised as the correlation between the enrolments of a selected school at two randomly drawn year (i.e. two randomly selected cohort from the same school). In other words, 64.41% of the total variability is attributable to the differences in enrolments within schools at different time periods.
summary(model1.0)
## Groups Name Variance Std.Dev. Corr
## qcaa_school_id (Intercept) 0.9608954 0.980253
## year15 0.0069467 0.083347 -0.534
## Residual 0.1964707 0.443250
## Estimate Std. Error t value
## (Intercept) 2.5658807 0.09690281 26.47891
## year15 0.1791002 0.01069684 16.74328
## Number of Level Two groups = 120
## Number of Level Three groups = NA
The next step involves incorporating the linear growth of time into the model. The model output is shown above.
When the subject was first introduced in 2015, schools were expected to have an average of 13.0124 (\(e^{2.56588}\)) enrolments. On average, the enrolments were expected to increase by 19.614% (\((e^{0.0353} - 1) \times 100\)) per year. The estimated within-school variance decreased by 48.75% (0.3834 to 0.1965), indicating the 48.75% can be explained by the linear growth in time.
| model | npar | AIC | BIC | logLik |
|---|---|---|---|---|
| model4.5 | 12 | 1869.409 | 1929.989 | -922.7043 |
| model4.1 | 14 | 1871.986 | 1942.663 | -921.9928 |
| model4.4 | 11 | 1874.380 | 1929.912 | -926.1898 |
| model4.0 | 16 | 1874.525 | 1955.299 | -921.2624 |
| model4.7 | 13 | 1876.474 | 1942.103 | -925.2370 |
| model4.3 | 10 | 1881.588 | 1932.072 | -930.7942 |
| model4.6 | 12 | 1884.298 | 1944.879 | -930.1491 |
| model4.9 | 11 | 1888.825 | 1944.357 | -933.4124 |
| model4.10 | 11 | 1888.825 | 1944.357 | -933.4124 |
| model4.2 | 11 | 1888.825 | 1944.357 | -933.4124 |
| model4.8 | 11 | 1888.825 | 1944.357 | -933.4124 |
As summarised in step 6, level-two predictors sector and unit will be added to the model. The largest possible model (model4.0) will first be fitted, before iteratively removing fixed effects one at a time (with model4.10 being the smallest of all 10 candidate models), whilst recording the AIC for each model. model4.5 (Table 29) appears to have the optimal (smallest) AIC, and will be used in the next section in building the final model.
| npar | AIC | BIC | logLik | deviance | Chisq | Df | Pr_boot(>Chisq) |
|---|---|---|---|---|---|---|---|
| 10 | 1907.775 | 1958.259 | -943.8874 | 1887.775 | NA | NA | NA |
| 12 | 1869.409 | 1929.989 | -922.7043 | 1845.409 | 42.36629 | 2 | 0 |
Figure 50: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model
The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. The p-value indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic (as indicated by the red line in Figure 50. There is overwhelming statistical evidence (\(\chi^2\) = 42.3663 and \(p\)-value = 0 – see Table 30) that the larger model (including random slope at level two) is the better model.
| var | 2.5 % | 97.5 % |
|---|---|---|
| sd_(Intercept)|qcaa_school_id | 0.7694684 | 1.0336552 |
| cor_year15.(Intercept)|qcaa_school_id | -0.7092237 | -0.3083679 |
| sd_year15|qcaa_school_id | 0.0557664 | 0.0927001 |
| sigma | 0.4187891 | 0.4583143 |
| (Intercept) | 2.3796251 | 3.5004024 |
| year15 | -0.0076337 | 0.1346439 |
| sectorGovernment | -0.7578332 | 0.4846176 |
| sectorIndependent | -2.4092098 | -0.8681366 |
| unityear_12_enrolments | -0.3080638 | -0.1124896 |
| year15:sectorGovernment | 0.0376980 | 0.1846632 |
| year15:sectorIndependent | 0.0957980 | 0.2969844 |
| year15:unityear_12_enrolments | 0.0095620 | 0.0524932 |
The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0, indicating that they’re different from 0 in the population (i.e. statistically significant).
Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year15_{tij} + \epsilon_{tij}\]
Level two (schools within postcodes) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + u_{1ij} \end{aligned}\]
Therefore, the composite model can be written as
\[\begin{aligned}Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year15_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12j}unit_{ij} + u_{1ij})year15_{tij} + \epsilon_{tij} \\ =\ & \left[\beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + \beta_{10j}year15_{tij} + \beta_{11j}sector_{ij}year15_{tij} + \beta_{12j}unit_{ij}year15_{tij} \right] \left[u_{0ij} + u_{1ij} + \epsilon_{tij} \right] \end{aligned}\]
summary(model_f)
## Groups Name Variance Std.Dev. Corr
## qcaa_school_id (Intercept) 0.802676 0.895922
## year15 0.005632 0.075047 -0.537
## Residual 0.192796 0.439085
## Estimate Std. Error t value
## (Intercept) 2.96690311 0.28548780 10.3923990
## year15 0.05859201 0.03459941 1.6934394
## sectorGovernment -0.14916989 0.30209047 -0.4937921
## sectorIndependent -1.64252982 0.39342178 -4.1749844
## unityear_12_enrolments -0.20422624 0.05216529 -3.9149833
## year15:sectorGovernment 0.11332298 0.03601076 3.1469199
## year15:sectorIndependent 0.19500364 0.04787424 4.0732474
## year15:unityear_12_enrolments 0.03002428 0.01136349 2.6421712
## Number of Level Two groups = 120
## Number of Level Three groups = NA
Based on the model output, the estimated mean enrolments for government schools are estimated to be 13.8577% (\((e^{-0.1491699} - 1) \times 100\)) less than that of catholic schools when the subject was first introduced in 2015. Government schools are also expected to have an average increase of 18.7577% (\((e^{0.0585920 + 0.1133230} - 1) \times 100\)), 11.9994% \(((e^{0.1133230} - 1) \times 100)\) more than that of catholic schools per year, .
On the other hand, independent schools are estimated to have an average enrolments of 80.651% (\((e^{-1.6425298} - 1) \times 100\)) less than that of catholic schools. However, this small initial status is matched with a 28.8651% \((e^{.0585920 + 0.1950036} - 1) \times 100\) increase in enrolments per year, on average. This increase is 21.5315% (\((e^{0.1950036} - 1) \times 100\)) greater than that of catholic schools.
Figure 51: Fixed effects of the final model for Aquatic Practices subject
The fixed effects can be visualised in Figure 51. As mentioned, catholic schools had the highest enrolments score in 2015, but had a slow increase in enrolments over the years; By 2018, government schools had greater enrolments, on average than catholic schools, and it is expected that independent schools have larger enrolments numbers, on average, than catholic schools. In all sectors, unit 12 appears to increase at a higher rate than that of unit 11, as shown by the steeper slope.
In the random effects, there is a some negative correlation between the random intercept and slope, where schools with lower enrolments when subject was first introduced are matched with a higher increase (decrease) in enrolments over the years. However, this negative relationship was not relatively strong, where the correlation between the random intercept and slope is only at -0.54, as shown in the model output.
Figure 52: Model predictions for year 11 enrolments for 20 randomly selected schools
Figure 52 above shows the predictions for 20 randomly selected schools.
Figure 53: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Biology subject
As described in step 1, a basic linear model was plotted for each school to provide insights of the enrolment trends for each school. Figure 53 demonstrates Biology may be introduced in schools in later years (e.g. school 235) and schools may have removed the subject and in some cases, school 1 and school 645 did not exist after 1997. School 570 showed a halt in the subject from 2005 to 2014. Schools can also vary greatly in enrolment sizes, for instance, schools 129, 132 and 570 had less than 50 enrolments for every cohort while some schools (e.g. school 235) have relatively larger cohort size. Some schools also showed a general decrease in enrolments across the years while some schools such at school 5 and 470 showed a significant increase in enrolments over the years.
All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year.
Figure 54: Effects of log transformation for response variable (enrolments) in Biology subject
A log transformation was used on the response variable (enrolments) to allow model to be estimated by the multilevel model, which assumes normality in the error terms. As shown in Figure 54.
| df | AIC | |
|---|---|---|
| Model0.2: Schools nested within districts | 4 | 30247.41 |
| Model0.1: Schools nested within postcodes | 4 | 30284.86 |
| Model0.0: Within schools | 3 | 30290.46 |
Referring back to step 3, three candidate models are fitted, with the AIC shown in Table 32. Model0.2, corresponding to having schools nested within districts is the best model, with optimised (lowest) AIC and will be used in the subsequent analysis.
summary(model0.2)
## Random effects:
## Groups Name Variance Std.Dev.
## qcaa_school_id:qcaa_district (Intercept) 0.61534 0.78444
## qcaa_district (Intercept) 0.11133 0.33366
## Residual 0.22037 0.46943
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.271578 0.09968013 32.82077
##
## Number of schools (level-two group) = 468
## Number of district (level-three group) = 13
In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:
The level-two ICC relates to the correlation between school \(i\) from a certain district \(k\) in time \(t\) and in time \(t^*\):
\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.6153}{(0.6153 + 0.1113 + 0.2204)} = 0.6497\]
This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 64.97% of the total variability is attributable to the differences between schools from the same district rather than changes over time within schools.
The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific school \(j\).
\[\text{Level-three ICC} = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.1113}{(0.6153 + 0.1113 + 0.1113)} = 0.1175\]
Similarly, it can be inferred that the correlation between enrolments of two randomly selected schools from different districts are 11.75%, where the total variability can be attributed to the difference between districts.
summary(model1.0)
## Groups Name Variance Std.Dev. Corr
## qcaa_district:qcaa_school_id (Intercept) 2.1466e+00 1.4651162
## year92 2.3778e-03 0.0487625 -0.805
## qcaa_district (Intercept) 6.5640e-02 0.2562035
## year92 6.0662e-05 0.0077886 0.409
## Residual 1.5031e-01 0.3876979
## Estimate Std. Error t value
## (Intercept) 2.68260254 0.101014731 26.556548
## year92 0.02593676 0.003253587 7.971744
## Number of Level Two groups = 468
## Number of Level Three groups = 13
The unconditional growth model adds the systematic changes over time, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to the linear changes over time. Based on the model output:
Biology was first introduced in 1992, and schools are expected to have 14.62 (\(e^{2.68260}\)), on average. Furthermore, enrolments were expected to increase by 2.628% (\(e^{0.02594} - 1) \times 100\)) every year. The estimated within-school variance decrease by 24.49% (0.2204 to 0.1503), implying that 24.49% of the within-school variability can be explained by the linear growth over time.
| model | npar | AIC | BIC | logLik |
|---|---|---|---|---|
| model4.1 | 17 | 23783.03 | 23918.34 | -11874.51 |
| model4.0 | 19 | 23784.32 | 23935.55 | -11873.16 |
| model4.7 | 16 | 23784.71 | 23912.06 | -11876.35 |
| model4.5 | 15 | 23789.67 | 23909.07 | -11879.84 |
| model4.4 | 14 | 23791.07 | 23902.50 | -11881.53 |
| model4.9 | 14 | 23853.67 | 23965.11 | -11912.84 |
| model4.10 | 14 | 23853.67 | 23965.11 | -11912.84 |
| model4.2 | 14 | 23853.67 | 23965.11 | -11912.84 |
| model4.8 | 14 | 23853.67 | 23965.11 | -11912.84 |
| model4.3 | 13 | 23858.62 | 23962.10 | -11916.31 |
| model4.6 | 15 | 23859.14 | 23978.53 | -11914.57 |
As detailed in step 6, level-two predictors (sector and unit) are added to the model. The largest possible model will be fitted, before removing each fixed effect one by one whilst recording the AIC for each model. model4.0 corresponds to the largest model while model4.10 is the smallest possible model. The model with the optimal (lowest) AIC is model4.1, and will be used in subsequent sections.
| npar | AIC | BIC | logLik | deviance | Chisq | Df | Pr_boot(>Chisq) |
|---|---|---|---|---|---|---|---|
| 15 | 23781.72 | 23901.12 | -11875.86 | 23751.72 | NA | NA | NA |
| 17 | 23783.03 | 23918.35 | -11874.51 | 23749.03 | 2.694776 | 2 | 0.126 |
Figure 55: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model
The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. Figure 55 displays the likelihood ratio test statistic from the null distribution, with the red line representing the likelihood ratio test statistic using the actual data. The p-value of 14.5% (Table 34) indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. The large estimated \(p\)-value is 0.145 < 0.05 fails to reject the null hypothesis at the 5% level, indicating that the smaller model (without random slope at level three) is preferred.
| var | 2.5 % | 97.5 % |
|---|---|---|
| sd_(Intercept)|school_postcode:qcaa_school_id | 0.8530526 | 1.2182627 |
| cor_year94.(Intercept)|school_postcode:qcaa_school_id | -0.9690616 | -0.8565112 |
| sd_year94|school_postcode:qcaa_school_id | 0.0315056 | 0.0460868 |
| sd_(Intercept)|school_postcode | 0.0000009 | 0.4581153 |
| sigma | 0.4876907 | 0.5140716 |
| (Intercept) | 1.4227924 | 2.2504155 |
| year94 | 0.0005497 | 0.0184106 |
| sectorGovernment | -0.3117705 | 0.4153108 |
| sectorIndependent | -0.0628937 | 0.7950850 |
| unityear_12_enrolments | -0.2747165 | -0.1289871 |
| year94:unityear_12_enrolments | 0.0026651 | 0.0107254 |
The parametric bootstrap is utilised to construct confidence intervals (as detailed in step 8). If the confidence intervals for the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level.
The 95% confidence interval is shown above (Table 35), and the random effects all exclude 0, further reiterating that they are statistically significant at the 5% level. Some fixed effects such as unityear_12_enrolments were insignificant, suggesting that there were no differences between unit 11 and unit 12 units.
Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]
Level two (schools within districts) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij} \end{aligned}\]
Level three (districts) \[\begin{aligned} \beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{03j} &= \gamma_{030} + r_{03j} \\ \beta_{10j} &= \gamma_{100} \\ \beta_{11j} &= \gamma_{110} \end{aligned}\]
Therefore, the composite model can be written as
\[\begin{aligned} Y_{tij} =\ &\pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ &(\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ =\ & \left[(\gamma_{000} + r_{00j}) + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + (\gamma_{030} + r_{03j})sector_{ij}unit_{ij} + u_{0ij} \right] + \\ & \left[\gamma_{100} + \gamma_{110}sector_{ij} + u_{1ij} \right]year92_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{020}sector_{ij}unit_{ij} + \gamma_{100}year92_{tij} + \gamma_{110}year92_{tij}sector_{ij} \right] + \\& \left[ r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + r_{03j}sector_{ij}unit_{ij} + u_{0ij} + u_{1ij} \epsilon_{tij}\right] \end{aligned}\]
summary(model_f)
## Groups Name Variance Std.Dev. Corr
## qcaa_district:qcaa_school_id (Intercept) 1.8582181 1.363165
## year92 0.0020144 0.044882 -0.777
## qcaa_district (Intercept) 0.1468611 0.383225
## Residual 0.1493206 0.386420
## Estimate Std. Error t value
## (Intercept) 2.967847727 0.1906944799 15.5633646
## year92 0.026446106 0.0053039215 4.9861420
## sectorGovernment 0.170207789 0.1829943554 0.9301259
## sectorIndependent -1.266171057 0.2020482537 -6.2666766
## unityear_12_enrolments -0.003237843 0.0157576995 -0.2054769
## year92:sectorGovernment -0.016269000 0.0061142842 -2.6608184
## year92:sectorIndependent 0.031103798 0.0067695649 4.5946524
## year92:unityear_12_enrolments -0.001164564 0.0006060924 -1.9214299
## sectorGovernment:unityear_12_enrolments -0.046618866 0.0143680795 -3.2446136
## sectorIndependent:unityear_12_enrolments -0.039104708 0.0162432846 -2.4074384
## Number of Level Two groups = 468
## Number of Level Three groups = 13
Using the model output above (see step 9 for detailed explanation on fixed effects), the estimated increase in mean enrolments for government schools is 1.0229% (\(e^{0.0264 - 0.0162} - 1) \times 100\)), which is 1.6402% (\((e^{0.01626900} - 1) \times 100\)) less than that of catholic schools. On the other hand, the mean enrolments for independent schools are estimated to increase by 5.9238% (\(e^{0.0264 - 0.0162} - 1) \times 100\)) each year, which is 3.1592% more than catholic schools.
Figure 56: Fixed effects of the final model for Biology subject
The fixed effects can be better visualised in Figure 56, independent schools appears to have the highest average increase in enrolments per year. It appears that after 2022, enrolments in independent schools are predicted to be higher than government schools, on average. Unit 11 enrolments only appears to be marginally smaller than unit 12 enrolments for government and independent schools, but appears to be the same for catholic schools. Government schools starts of (in 1992) with the highest enrolments on average, but it is matched with a slow increase in enrolments over the years.
Figure 57: Random effects for schools
Figure 57 displays the random effects for a given school. It is apparent that the random intercepts and slopes are negatively correlated, where a large intercept is associated with a smaller random slope. This indicates that a larger school is associated with a smaller increase (decrease) in enrolments over the years while smaller schools are predicted to have larger increase in enrolments over the years.
Figure 58: Random intercept for districts
As the random slopes are removed, all districts are predicted to have the same increase in enrolments over the years; And as was discussed previously, this was a reasonable assumption or an otherwise perfect correlation with random slope and intercept will be fitted. Figure 58 demonstrates that schools in Brisbane Central has the largest enrolments while Mackay have the lowest enrolments in Biology subject, on average.
Figure 59: Model predictions for 20 randomly selected schools
Figure 59 above shows the predictions for 20 randomly selected schools.
Figure 60: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Chemistry subject
With reference to the first step, Figure 60 fits a linear model for enrolments for a random sample of 20 schools. The various school sizes is apparent; To illustrate, school 631 and 647 had enrolments of less than 20 each year, while schools such as schools 90 and 555 consistently showed enrolments of approximately more than 50 students each year. Some schools discontinued the subjects (e.g. school 96) while other schools only introduced the subject in the later years (e.g. school 315).
All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.
Figure 61: Effects of log transformation for response variable (enrolments) in Chemistry subject
As multilevel model assumes normality in the error terms, a log transformation is utilised to allow models to be estimated by the linear mixed models. The log transformation allows enrolment numbers to be approximately normally distributed (Figure 61).
| df | AIC | |
|---|---|---|
| Model0.2: Schools nested within districts | 4 | 28379.04 |
| Model0.1: Schools nested within postcodes | 4 | 28401.11 |
| Model0.0: Within schools | 3 | 28422.33 |
As per step 3, the three potential models are fitted, with the AIC shown in Table 36. Based on the AIC, model0.2, corresponding the schools nested within districts is the best model and will be used in the subsequent analysis.
summary(model0.2)
## Random effects:
## Groups Name Variance Std.Dev.
## qcaa_school_id:qcaa_district (Intercept) 0.58885 0.76736
## qcaa_district (Intercept) 0.10767 0.32813
## Residual 0.20940 0.45760
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.883525 0.09814118 29.3814
##
## Number of schools (level-two group) = 454
## Number of district (level-three group) = 13
This model takes into account 454 schools nested in 13 districts. In a three-level multilevel model, two intraclass correlations can be obtained using the model summary output above:
The level-two ICC relates to the correlation between school \(i\) from district \(k\) in time \(t\) and in time \(t^* \ne t\):
\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.5888}{(0.5888 + 0.1077 + 0.2094)} = 0.6450\]
This can be conceptualised as the correlation between enrolments of two random draws from the same school at two different years. In other words, 64.50% of the total variability is attributable to the changes overtime within schools.
The level-three ICC refers to the correlation between different schools \(i\) and \(i^*\) from a specific district \(j\) in time \(t\) and time \(t^* \ne t\).
\[\text{Level-three ICC} = \frac{\phi^2_{00}}{\tau^2_{00} + \phi^2_{00} + \sigma^2} = \frac{0.1077}{(0.5888 + 0.1077 + 0.2094)} = 0.1189\] Similarly, this can be conceptualised as the correlation between enrolments of two randomly selected schools from the same district – i.e. 11.70% of the total variability is due to the difference between districts.
The unconditional growth model introduces the time predictor at level one, the model specification can be found in step 4. This allows for assessing within-school variability which can be attributed to linear changes over time. Furthermore, variability in intercepts and slopes can be obtained to compare schools within the same districts, and schools from different districts.
summary(model1.0)
## Groups Name Variance Std.Dev. Corr
## qcaa_district:qcaa_school_id (Intercept) 1.7739e+00 1.3318683
## year92 1.8155e-03 0.0426081 -0.790
## qcaa_district (Intercept) 6.1621e-02 0.2482367
## year92 6.1531e-05 0.0078442 0.512
## Residual 1.4798e-01 0.3846785
## Estimate Std. Error t value
## (Intercept) 2.40033451 0.095359577 25.171405
## year92 0.02165636 0.003068893 7.056736
## Number of Level Two groups = 454
## Number of Level Three groups = 13
When the subject was first introduced in 1992, schools were expected to have 11.0265 (\(e^{1.7848}\)) enrolments, on average. Furthermore, on average, enrolments were expected to increase by 2.1893% (\((e^{0.0216564} - 1) \times 100\)) per year. The estimated within-schools variance decreased by 29.3326% (0.2094 to 0.14797754), implying that 29.3326% of within-school variability can be explained by the linear growth over time.
| model | AIC |
|---|---|
| model4.7 | 22651.34 |
| model4.1 | 22651.36 |
| model4.4 | 22653.66 |
| model4.5 | 22654.01 |
| model4.0 | 22656.29 |
| model4.2 | 22718.07 |
| model4.8 | 22718.07 |
| model4.9 | 22718.07 |
| model4.10 | 22718.07 |
| model4.6 | 22718.07 |
| model4.3 | 22720.78 |
As highlighted in step 6, sector and unit will be added as predictors to the model. The largest possible model will be fitted, before removing fixed effects one by one while recording the AIC for each model. In this case, model4.0 corresponds to the largest possible model while model4.10 is the smallest possible model. The model with the optimal (lowest) AIC is model4.4 (Table 37). The next section will test the selected model’s random effects to build the final model.
| npar | AIC | BIC | logLik | deviance | Chisq | Df | Pr_boot(>Chisq) |
|---|---|---|---|---|---|---|---|
| 14 | 22651.66 | 22762.71 | -11311.83 | 22623.66 | NA | NA | NA |
| 16 | 22651.34 | 22778.26 | -11309.67 | 22619.34 | 4.311858 | 2 | 0.04 |
Figure 62: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model
The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. Figure 62 displays the likelihood ratio test statistic from the null distribution, with the red line representing the likelihood ratio test statistic using the actual data. The p-value of 0.04% (Table 38) indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. The estimated \(p\)-value is 0.04 < 0.05 fails to reject the null hypothesis at the 5% level, indicating that the larger model (without random slope at level three) is preferred.
| var | 2.5 % | 97.5 % |
|---|---|---|
| sd_(Intercept)|qcaa_district:qcaa_school_id | 1.1802941 | 1.3535015 |
| cor_year92.(Intercept)|qcaa_district:qcaa_school_id | -0.8005691 | -0.7167320 |
| sd_year92|qcaa_district:qcaa_school_id | 0.0362919 | 0.0424250 |
| sd_(Intercept)|qcaa_district | 0.0854839 | 0.4588320 |
| cor_year92.(Intercept)|qcaa_district | -0.6818153 | 1.0000000 |
| sd_year92|qcaa_district | 0.0006963 | 0.0113808 |
| sigma | 0.3792006 | 0.3871578 |
| (Intercept) | 2.3667074 | 3.0418792 |
| year92 | 0.0071077 | 0.0269531 |
| sectorGovernment | -0.2661168 | 0.3889485 |
| sectorIndependent | -1.4463471 | -0.6984020 |
| unityear_12_enrolments | -0.0646203 | -0.0173266 |
| year92:sectorGovernment | -0.0209668 | 0.0013685 |
| year92:sectorIndependent | 0.0189922 | 0.0427861 |
| sectorGovernment:unityear_12_enrolments | -0.0624251 | -0.0069597 |
| sectorIndependent:unityear_12_enrolments | -0.0527715 | 0.0100726 |
The parametric bootstrap is utilised to construct confidence intervals (as detailed in step 8). If the confidence intervals for the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level.
The 95% confidence interval is shown above (Table 39), and the random effects all exclude 0, further reiterating that they are statistically significant at the 5% level.
Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]
Level two (schools within districts) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij} \end{aligned}\]
Level three (districts) \[\begin{aligned} \beta_{00j} = \gamma_{000} + r_{00j} \\ \beta_{01j} = \gamma_{010} + r_{01j} \\ \beta_{02j} = \gamma_{020} + r_{02j} \\ \beta_{03j} = \gamma_{030} + r_{03j} \\ \beta_{10j} = \gamma_{100} + r_{10j} \\ \beta_{11j} = \gamma_{110} + r_{11j} \end{aligned}\]
Therefore, the composite model can be written as
\[\begin{aligned} Y_{tij} =\ &\pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ &(\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ =\ & \left[(\gamma_{000} + r_{00j}) + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + (\gamma_{030} + r_{03j})sector_{ij}unit_{ij} + u_{0ij} \right] + \\ & \left[(\gamma_{100} + r_{10j}) + (\gamma_{110} + r_{11j})sector_{ij} + u_{1ij} \right]year92_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{020}sector_{ij}unit_{ij} + \gamma_{100}year92_{tij} + \gamma_{110}year92_{tij}sector_{ij} \right] + \\ & \left[ r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + r_{03j}sector_{ij}unit_{ij} + u_{0ij} + r_{10j}year92_{tij} + r_{11j}year92_{tij}sector_{ij} + u_{1ij} \epsilon_{tij} \right] \end{aligned}\]
summary(model_f)
## Groups Name Variance Std.Dev. Corr
## qcaa_district:qcaa_school_id (Intercept) 1.6097e+00 1.2687521
## year92 1.5320e-03 0.0391402 -0.765
## qcaa_district (Intercept) 7.7313e-02 0.2780516
## year92 3.3268e-05 0.0057678 0.621
## Residual 1.4672e-01 0.3830370
## Estimate Std. Error t value
## (Intercept) 2.69358309 0.165949162 16.2313751
## year92 0.01832166 0.004966819 3.6888110
## sectorGovernment 0.06422409 0.170475913 0.3767341
## sectorIndependent -1.04537987 0.188584541 -5.5432957
## unityear_12_enrolments -0.04196115 0.012373547 -3.3911980
## year92:sectorGovernment -0.01003111 0.005433864 -1.8460368
## year92:sectorIndependent 0.03050230 0.006036915 5.0526299
## sectorGovernment:unityear_12_enrolments -0.03559772 0.014308159 -2.4879313
## sectorIndependent:unityear_12_enrolments -0.02235697 0.016211174 -1.3791088
## Number of Level Two groups = 454
## Number of Level Three groups = 13
Using the model output above (see step 9 for detailed explanation on fixed effects), the estimated increase in mean enrolments for government schools is 0.832496% (\((e^{0.0183 - 0.0100} - 1) \times 100\)), which is 1.0082% (\((e^{0.0100} - 1) \times 100\)) less than that of catholic schools. On the other hand, the mean enrolments for independent schools are estimated to increase by 5.0036% (\(e^{0.01832 + 0.03050} - 1) \times 100\)) each year, which is 3.0972% (\((e^{0.0305022}-1) \times 100\)) more than catholic schools.
Figure 63: Fixed effects of the final model for Chemistry
The fixed effects are better accentuated in Figure 63. As aforementioned, independent school have the lowest initial status, that is, the least enrolments when the subject was first introduced. However, this low enrolments was matched with a relatively high slope compared to the other sectors. From 2021 onwards, independent schools are expected to have greater enrolments than government schools, on average. Government schools have the highest initial status, but are expected to have the smallest increase in enrolments over the years (as demonstrated by the gentle slope).
Figure 64: Random effects for schools
Figure 64 displays the random effects for a given school. It is apparent that the random intercepts and slopes are negatively correlated, where a large intercept is associated with a smaller random slope (csorrelation = -0.76, as shown in the model output). This indicates that a larger school is associated with a smaller increase (decrease) in enrolments over the years while smaller schools are predicted to have larger increase in enrolments over the years.
Figure 65: Random intercept for districts
The model includes a random slope at level three, and the effects of the random intercept and slope in level three is shown in Figure 65. As shown in the model output, correlation between the random intercept and slope is 0.62. Loosely speaking, this suggests that districts with large enrolments are going to be larger, while smaller districts are going to increase (decrease) at a slower rate. Based on Figure 65, Gold Coast are estimated to have the highest slope, indicating that the rate of change in enrolment is the greatest compared to the other districts.
Figure 66: Model predictions for 20 randomly selected schools
Figure 66 above shows the predictions for 20 randomly selected schools.
Figure 67: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Earth and Environmental Science subject
Figure 67 fits a linear model for 20 randomly selected schools. The difference in enrolment numbers across the cohorts between schools is apparent, for example, school 405 and 532 which consistently showed little enrolments (< 20) per year while schools such as school 175 and 618 demonstrates significantly higher enrolment numbers in a single cohort.
Some schools offered the subject in little years, such as school 233 and 588, which only offered the subject in the new QCE system. There were also varying patterns in enrolment trends where school 618 showed a large increase in enrolments over the years while school 111 appears to have a stark decrease in enrolments since it offered the subject. In some cases, there were only a few students enrolments in the school, such as school 532, which only showed 1 enrolment when the school offered the subject.
All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.
Figure 68: Effects of log transformation for response variable (enrolments) in Earth and Environmental Science subject
The enrolments were right skewed, which is likely to be attributed to the various school sizes (as seen in Figure 67). A log transformation was implemented to the response variable (i.e. enrolments) to allow the the multilevel model to better capture the enrolment patterns.
| df | AIC | |
|---|---|---|
| Model0.0: Within schools | 3 | 2243.65 |
| Model0.1: Schools nested within postcodes | 4 | 2244.80 |
| Model0.2: Schools nested within districts | 4 | 2245.65 |
As outlined in step 3, the three candidate models are fitted and their AIC is shown in Table 40. Based on the AIC, the two-level model (model0.0) is the superior model and will be used in the subsequent analysis.
summary(model0.0)
## Random effects:
## Groups Name Variance Std.Dev.
## qcaa_school_id (Intercept) 1.51734 1.2318
## Residual 0.29507 0.5432
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.06781 0.1325348 15.60202
##
## Number of schools (level-two group) = 92
## Number of district (level-three group) = NA
This model takes into account 92. For a two-level multilevel model, the level two intraclass correlation coefficient (ICC) can be computed using the model output above. The level-two ICC is the correlation between a school \(i\) in time \(t\) and time \(t^*\):
\[\text{Level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{1.5173}{(1.5173 + 0.2951)} = 0.8372\] This can be conceptualised as the correlation between the enrolments of a selected school at two randomly drawn year (i.e. two randomly selected cohort from the same school). In other words, 83.72% of the total variability is attributable to the differences in enrolments within schools at different time periods.
summary(model1.0)
## Groups Name Variance Std.Dev. Corr
## qcaa_school_id (Intercept) 2.0601140 1.435310
## year92 0.0028797 0.053663 -0.847
## Residual 0.2458310 0.495813
## Estimate Std. Error t value
## (Intercept) 1.63148417 0.176855363 9.224963
## year92 0.03527984 0.007508286 4.698787
## Number of Level Two groups = 92
## Number of Level Three groups = NA
The next step involves incorporating the linear growth of time into the model. The model output is shown above.
When the subject was first introduced in 1992, schools were expected to have an average of 5.1115 (\(e^{1.6314842}\)) enrolments, which is a relatively low number as compared to the other mathematics and science subjects. Furthermore, Figure 2 demonstrated that the number of schools offering the subject in 1992 was the highest among all years.
On average, the enrolments were expected to increase by 3.59304% (\((e^{0.0353} - 1) \times 100\)) per year. The estimated within-school variance decreased by 16.70% (0.2951 to 0.2458), indicating the 16.70% can be explained by the linear growth in time.
| model | npar | AIC | BIC | logLik |
|---|---|---|---|---|
| model4.4 | 11 | 2083.856 | 2139.727 | -1030.928 |
| model4.5 | 12 | 2085.205 | 2146.155 | -1030.603 |
| model4.7 | 13 | 2086.236 | 2152.265 | -1030.118 |
| model4.3 | 10 | 2086.805 | 2137.596 | -1033.402 |
| model4.1 | 14 | 2087.644 | 2158.753 | -1029.822 |
| model4.10 | 11 | 2087.848 | 2143.719 | -1032.924 |
| model4.9 | 11 | 2087.848 | 2143.719 | -1032.924 |
| model4.2 | 11 | 2087.848 | 2143.719 | -1032.924 |
| model4.8 | 11 | 2087.848 | 2143.719 | -1032.924 |
| model4.6 | 12 | 2089.241 | 2150.191 | -1032.621 |
| model4.0 | 16 | 2091.481 | 2172.748 | -1029.740 |
As summarise in step 6, level-two predictors secotr and unit will be added to the model. The largest possible model (model4.0) will first be fitted, before iteratively removing fixed effects one at a time (with model4.10 being the smallest of all 10 candidate models), whilst recording the AIC for each model. model4.4 appears to have the optimal (smallest) AIC (Table 41), and will be used in the next section in building the final model.
| npar | AIC | BIC | logLik | deviance | Chisq | Df | Pr_boot(>Chisq) |
|---|---|---|---|---|---|---|---|
| 9 | 2223.393 | 2269.105 | -1102.696 | 2205.393 | NA | NA | NA |
| 11 | 2083.856 | 2139.727 | -1030.928 | 2061.856 | 143.537 | 2 | 0 |
Figure 69: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model
The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. The p-value indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. Figure 69 displays the likelihood ratio test statistic from the null distribution, with the red line indicates the likelihood ratio test statistic using the actual data.
There is overwhelming statistical evidence (\(\chi^2\) = 143.537 and \(p\)-value = 0 from Table 42) that the larger model (including random slope at level two) is the better model.
| var | 2.5 % | 97.5 % |
|---|---|---|
| sd_(Intercept)|qcaa_school_id | 1.0791392 | 1.5883744 |
| cor_year92.(Intercept)|qcaa_school_id | -0.9079310 | -0.6751860 |
| sd_year92|qcaa_school_id | 0.0393192 | 0.0678080 |
| sigma | 0.4746765 | 0.5167394 |
| (Intercept) | 0.5979313 | 2.9539799 |
| year92 | -0.0293459 | 0.0774094 |
| sectorGovernment | -1.6512267 | 0.7654638 |
| sectorIndependent | -0.4903149 | 2.2137154 |
| unityear_12_enrolments | -0.0447976 | 0.0733969 |
| year92:sectorGovernment | -0.0366734 | 0.0751267 |
| year92:sectorIndependent | -0.0819655 | 0.0454768 |
The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0 (Table 43), indicating that they’re different from 0 in the population (i.e. statistically significant).
Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]
Level two (schools within districts) will contain new predictor(sector)
\[\begin{aligned} \pi_{0ij} = \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij} \\ \pi_{1ij} = \beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij} \end{aligned}\]
The composite model can therefore be written as: \[\begin{aligned} Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij})year92_{tij} + \epsilon_{tij} \\ =\ & \left[\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02}unit_{ij} + \beta_{10j}year92_{tij} + \beta_{11j}sector_{ij}year92_{tij} \right] + \left[u_{0ij} + u_{1ij} + \epsilon_{tij} \right] \end{aligned}\]
summary(model_f)
## Groups Name Variance Std.Dev. Corr
## qcaa_school_id (Intercept) 1.8069480 1.344228
## year92 0.0028392 0.053284 -0.824
## Residual 0.2457532 0.495735
## Estimate Std. Error t value
## (Intercept) 1.70508906 0.63997319 2.6643133
## year92 0.02482993 0.02880730 0.8619320
## sectorGovernment -0.35142132 0.66866804 -0.5255542
## sectorIndependent 0.96758390 0.74465152 1.2993781
## unityear_12_enrolments 0.01447581 0.02919046 0.4959088
## year92:sectorGovernment 0.01972213 0.03014928 0.6541493
## year92:sectorIndependent -0.02215715 0.03284658 -0.6745648
## Number of Level Two groups = 92
## Number of Level Three groups = NA
Based on the model output, the estimated mean enrolments for government schools are estimated to be 29.63% (\((e^{0.3514213} - 1) \times 100\)) less than that of catholic schools when the subject was first introduced in 1992. However, government schools are estimated to have a mean increase of 4.5560% (\((e^{0.0248299 + 0.0197221} - 1) * \times 100\)) per year, which is 1.9918% (\((e^{0.0197221} - 1) * \times 100\)) greater than the increase in enrolments in catholic schools.
Independent schools are estimated to have a 163.158% (\((e^{0.9675839} - 1) \times 100\)) more than that of catholic schools in 1992. However, independent schools showed a slow increase in enrolments (0.26786%) over per year, on average. This increase in enrolments is -2.1913% (\((exp^{-0.0221572} - 1) * 100\)) less than that of catholic schools.
Figure 70: Fixed effects of the final model for Agricultural Practices subject
The results can be better visualised in Figure 70. On average Government schools started off with little enrolments, but showed a stark increase in enrolments over the years. In contrast, Independent schools have relatively high enrolments when the subject was first introduced in 1992, but showed little increase over the years.
Figure 71: Random effects for all schools
Figure 71 shows the random effects for all 92 schools that offered the subject. There is a clear negative correlation between the random intercept and the random slope, which indicates that in general, schools with lesser enrolments are generally matched with a larger increase in enrolments over the years.
Figure 72: Model predictions for year 11 enrolments for 20 randomly selected schools
Figure 72 above shows the predictions for 20 randomly selected schools.
Figure 73: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Marine Science
Figure 73 shows a basic linear plot for 20 randomly selected schools. The subject was introduced in 2014, and some school offered the subjects at later years while some schools discontinued the subject (e.g. school 373). Next, some schools (e.g. school 220 and 233) showed a large increase in enrolments relative to the other schools while some schools showed a decrease in enrolments over the years. The various school sizes can be seen, where some schools have less than 25 enrolments each year while some school have over 50 enrolments per cohort.
As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. For these reasons, all completion years with zero enrolments will also be removed for modelling.
Figure 74: Effects of log transformation for response variable (enrolments) in Marine Science
The enrolments were right skewed, which is likely to be attributed to the various school sizes (as seen in Figure 74). A log transformation was implemented to the response variable (i.e. enrolments) to allow the the multilevel model to better capture the enrolment patterns.
| df | AIC | |
|---|---|---|
| Model0.0: Within schools | 3 | 1261.849 |
| Model0.2: Schools nested within districts | 4 | 1262.921 |
| Model0.1: Schools nested within postcodes | 4 | 1263.849 |
As underlined in Step 3, the three candidate models are fitted and their AIC is shown in Table 44. Based on the AIC, the two-level model (model0.0) is the best model and will be used in the subsequent analysis.
## Random effects:
## Groups Name Variance Std.Dev.
## qcaa_school_id (Intercept) 0.44111 0.66416
## Residual 0.21735 0.46620
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.955045 0.07678718 38.48358
##
## Number of schools (level-two group) = 81
## Number of district (level-three group) = NA
This model takes into account 81 schools which offer the subject. For a two-level multilevel model, the level two intraclass correlation coefficient (ICC) can be computed using the model output above.
The level-two ICC is the correlation between a school \(i\) in time \(t\) and time \(t^*\):
\[ \text{level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.4411}{(0.4411 + 0.2173)} = 0.6700\]
This can be conceptualised as the correlation between the enrolments of a selected school at two randomly drawn year (i.e. two randomly selected cohort from the same school). In other words, 67% of the total variability is attributable to the differences in enrolments within schools at different time periods.
## Groups Name Variance Std.Dev. Corr
## qcaa_school_id (Intercept) 0.4715552 0.686699
## year15 0.0071891 0.084788 -0.284
## Residual 0.1628130 0.403501
## Estimate Std. Error t value
## (Intercept) 2.7409337 0.08744009 31.34642
## year15 0.0464796 0.01279031 3.63397
## Number of Level Two groups = 81
## Number of Level Three groups = NA
The next step involves incorporating the linear growth of time into the model. The model output is shown above.
When the subject was first introduced in 2015, schools were expected to have an average of 15.5009 (\(e^{2.7409}\)) enrolments. On average, the enrolments were expected to increase by 4.7598% (\((e^{0.0465} - 1) \times 100\)) per year. The estimated within-school variance decreased by 25.07% (0.2173 to 0.162813), indicating the 25.07% can be explained by the linear growth in time.
| model | AIC |
|---|---|
| model4.4 | 1127.338 |
| model4.2 | 1128.333 |
| model4.8 | 1128.333 |
| model4.10 | 1128.333 |
| model4.9 | 1128.333 |
| model4.3 | 1128.400 |
| model4.7 | 1128.990 |
| model4.5 | 1129.327 |
| model4.6 | 1130.333 |
| model4.1 | 1130.989 |
| model4.0 | 1134.839 |
As summarise in step 6, level-two predictors secotr and unit will be added to the model. The largest possible model (model4.0) will first be fitted, before iteratively removing fixed effects one at a time (with model4.10 being the smallest of all 10 candidate models), whilst recording the AIC for each model. model4.4 (Table 45) appears to have the optimal (smallest) AIC, and will be used in the next section in building the final model.
| npar | AIC | BIC | logLik | deviance | Chisq | Df | Pr_boot(>Chisq) |
|---|---|---|---|---|---|---|---|
| 9 | 1177.905 | 1219.838 | -579.9523 | 1159.905 | NA | NA | NA |
| 11 | 1127.338 | 1178.590 | -552.6690 | 1105.338 | 54.56667 | 2 | 0 |
The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. The p-value indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic (as indicated by the red line in Figure ??. There is overwhelming statistical evidence (\(\chi^2\) = 54.5667 and \(p\)-value = 0 – see Table 46) that the larger model (including random slope at level two) is the better model.
| var | 2.5 % | 97.5 % |
|---|---|---|
| sd_(Intercept)|qcaa_school_id | 0.5516981 | 0.8199168 |
| cor_year15.(Intercept)|qcaa_school_id | -0.6355994 | -0.0908638 |
| sd_year15|qcaa_school_id | 0.0604489 | 0.1065415 |
| sigma | 0.3809136 | 0.4236035 |
| (Intercept) | 2.4156300 | 3.2727234 |
| year15 | -0.0301446 | 0.0911383 |
| sectorGovernment | -0.4283699 | 0.5614812 |
| sectorIndependent | -0.9822681 | 0.1326594 |
| unityear_12_enrolments | -0.1255390 | -0.0104000 |
| year15:sectorGovernment | -0.0252647 | 0.1139009 |
| year15:sectorIndependent | -0.0999005 | 0.0667181 |
The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0 (Table 47), indicating that they’re different from 0 in the population (i.e. statistically significant).
Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij}\]
Level two (schools within districts) \[\begin{aligned} \pi_{0ij} = \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij} \\ \pi_{1ij} = \beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij} \end{aligned}\]
Level three (districts) \[\begin{aligned} \beta_{00j} &= \gamma_{000} + r_{00j} \\ \beta_{01j} &= \gamma_{010} + r_{01j} \\ \beta_{02j} &= \gamma_{020} + r_{02j} \\ \beta_{10j} &= \gamma_{100} + r_{10j} \\ \beta_{11j} &= \gamma_{110} + r_{11j} \end{aligned}\]
Therefore, the composite model can be written as
\[\begin{aligned} Y_{tij} =\ &\pi_{0ij} + \pi_{1ij}year92_{tij} + \epsilon_{tij} \\ =\ &(\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + u_{0ij}) + (\beta_{10j} + \beta_{11j}sector_{ij} + u_{1ij})year15_{tij} +\epsilon_{tij} \\ =\ & \left[\gamma_{000} + r_{00j} + (\gamma_{010} + r_{01j})sector_{ij} + (\gamma_{020} + r_{02j})unit_{ij} + u_{0ij} \right] + \\ & \left[\gamma_{100} + r_{10j} + (\gamma_{110} + r_{11j})sector_{ij}\right]year15_{tij} + \epsilon_{tij} \\ =\ & \left[\gamma_{000} + \gamma_{010}sector_{ij} + \gamma_{020}unit_{ij} + \gamma_{100}year15_{tij} + \gamma_{110}sector_{ij}year15_{tij} \right] + \\ & \left[r_{00j} + r_{01j}sector_{ij} + r_{02j}unit_{ij} + r_{10j}year15_{tij} + r_{11j}sector_{ij}year15_{tij} + u_{0ij} + \epsilon_{tij} \right] \end{aligned}\]
## Random effects:
## Groups Name Variance Std.Dev. Corr
## qcaa_school_id (Intercept) 0.4666221 0.683097
## year15 0.0069507 0.083371 -0.425
## Residual 0.1615822 0.401973
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.84400284 0.21194908 13.4183310
## year15 0.02927534 0.03094077 0.9461739
## sectorGovernment 0.03976737 0.23892386 0.1664437
## sectorIndependent -0.44995296 0.28698959 -1.5678372
## unityear_12_enrolments -0.06749144 0.02913496 -2.3165101
## year15:sectorGovernment 0.03977166 0.03469870 1.1462002
## year15:sectorIndependent -0.01081485 0.04262658 -0.2537113
##
## Number of schools (level-two group) = 81
## Number of district (level-three group) = NA
Based on the model output (see detailed explanation of fixed effects in step 9), the estimated mean enrolments for government schools are estimated to increase by 7.1487% (\((e^{0.0292753+0.0397717} - 1) \times 100\)) each year, which is 4.0573% (\((e^{0.0397717} - 1) \times 100)\) more than that of catholic schools.
On the other hand, independent schools are estimated to have a 1.8632% (\((e^{0.0292753 -0.0108148} - 1) \times 100\)) increase in enrolments per year, on average. This increase is 1.0757% (\((e^{-0.0108148}) - 1) \times 100\)) less than that of catholic schools.
Figure 75: Fixed effects of the final model for Marine Science subject
The fixed effects are better accentuated in Figure 75. This figure shows that government schools are expected to have the largest increase in enrolments over the years, on average. For all sectors, it appears that unit 12 enrolments are consistently less than that of unit 11 enrolments, which may suggests that students are not pursuing year 12 after completing year 11 syllabus.
Figure 76: Random effects for all schools
Figure 76 represents the random intercept and slope for the random effects for a given school. It is manifest that there are no clear relationship between the random intercept and slope. Some schools with low enrolments when the subject was first introduced the school saw a large increase in enrolments over the years, while some showed a sharp decrease.
Figure 77: Model predictions for 20 randomly selected schools
Figure 77 above shows the predictions for 20 randomly selected schools.
Figure 78: Basic linear model for 20 randomly selected schools to provide an at-a-glance visualisation of enrolment trends within schools for Specialist Mathematics subject
With reference to the first step, Figure 78 fits a linear model for enrolments for a random sample of 20 schools to provide insights on the enrolments trends in Science in Practice. As demonstrated, cohort size for each school vary, with smaller cohorts in schools 517, 530, 655 (bottom) which has enrolment numbers of approximately less than 50 each year; and larger schools such as school 416, which has more than 200 enrolments in a single cohort by the end of 2021. Some schools showed a steep increase in enrolments (e.g. school 416 and 501), while some schools showed rather constant or decreasing enrolments trends (e.g. school 655).
All zero enrolments in a given year will be removed for modelling. As aforementioned, most of the zero enrolments in year 11 (refer to Figure 9) were attributed to the 2007 prep year cohort while zero enrolments in year 12 relates to the first year in which a school introduces the subject. Other zero enrolments mostly relates to smaller schools with little to no enrolments in the subject for a given year. These zero enrolments will be removed for modelling purposes.
Figure 79: Effects of log transformation for response variable (enrolments) in Specialist Mathematics subject
As multilevel model assumes normality in the error terms, a log transformation is utilised to allow models to be estimated by the linear mixed models. The log transformation allows enrolment numbers to be approximately normally distributed (Figure 79.
| df | AIC | |
|---|---|---|
| Model0.0: Within schools | 3 | 4705.261 |
| Model0.2: Schools nested within districts | 4 | 4705.552 |
| Model0.1: Schools nested within postcodes | 4 | 4707.261 |
As outlined in step 3, the three candidate models are fitted and their AIC is shown in Table 48. Based on the AIC, the two-level model (model0.0) is the superior model and will be used in the subsequent analysis.
## Random effects:
## Groups Name Variance Std.Dev.
## qcaa_school_id (Intercept) 0.62634 0.79142
## Residual 0.47198 0.68701
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.75816 0.06012933 45.87046
##
## Number of schools (level-two group) = 196
## Number of district (level-three group) = NA
This model takes into account 196 schools. For a two-level multilevel model, the level two intraclass correlation coefficient (ICC) can be computed using the model output above. The level-two ICC is the correlation between a school \(i\) in time \(t\) and time \(t^*\):
\[ \text{level-two ICC} = \frac{\tau^{2}_{00}}{\tau^{2}_{00} + \phi^{2}_{00} + \sigma^2} = \frac{0.6263}{(0.6263 + 0.4720)} = 0.5702\]
This can be conceptualised as the correlation between the enrolments of a selected school at two randomly drawn year (i.e. two randomly selected cohort from the same school). In other words, 57.02% of the total variability is attributable to the differences in enrolments within schools at different time periods.
## Groups Name Variance Std.Dev. Corr
## qcaa_school_id (Intercept) 1.108130 1.05268
## year10 0.010253 0.10126 -0.720
## Residual 0.306377 0.55351
## Estimate Std. Error t value
## (Intercept) 1.8442381 0.097944037 18.82951
## year10 0.1004847 0.009731646 10.32556
## Number of Level Two groups = 196
## Number of Level Three groups = NA
The next step involves incorporating the linear growth of time into the model. The model output is shown above.
When the subject was first introduced in 2010, schools were expected to have an average of 6.3230 (\(e^{1.8442}\)) enrolments, which is a relatively low number as compared to the other mathematics and science subjects. On average, the enrolments were expected to increase by 10.5724% (\((e^{0.1005} - 1) \times 100\)) per year. The estimated within-school variance decreased by 16.70% (0.4720 to 0.30638), indicating the 35.089% can be explained by the linear growth in time.
| model | npar | AIC | BIC | logLik |
|---|---|---|---|---|
| model4.1 | 14 | 4022.482 | 4101.041 | -1997.241 |
| model4.6 | 12 | 4024.466 | 4091.802 | -2000.233 |
| model4.9 | 13 | 4025.268 | 4098.215 | -1999.634 |
| model4.7 | 13 | 4025.268 | 4098.215 | -1999.634 |
| model4.8 | 13 | 4025.268 | 4098.215 | -1999.634 |
| model4.0 | 16 | 4025.315 | 4115.096 | -1996.657 |
| model4.5 | 12 | 4025.633 | 4092.969 | -2000.816 |
| model4.3 | 10 | 4027.180 | 4083.293 | -2003.590 |
| model4.2 | 11 | 4027.303 | 4089.028 | -2002.651 |
| model4.4 | 11 | 4027.477 | 4089.202 | -2002.739 |
| model4.10 | 11 | 4031.656 | 4093.381 | -2004.828 |
As summarise in step 6, level-two predictors sector and unit will be added to the model. The largest possible model (model4.0) will first be fitted, before iteratively removing fixed effects one at a time (with model4.10 being the smallest of all 10 candidate models), whilst recording the AIC for each model. model4.1 appears to have the optimal (smallest) AIC (Table 49), and will be used in the next section in building the final model.
| npar | AIC | BIC | logLik | deviance | Chisq | Df | Pr_boot(>Chisq) |
|---|---|---|---|---|---|---|---|
| 12 | 4227.105 | 4294.442 | -2101.553 | 4203.105 | NA | NA | NA |
| 14 | 4022.482 | 4101.041 | -1997.241 | 3994.482 | 208.6234 | 2 | 0 |
Figure 80: Histogram of likelihood ratio test statistic, with a red vertical line indicating the likelihood ratio test statistic for the actual model
The parametric bootstrap is used to approximate the likelihood ratio test statistic to produce a more accurate p-value by simulating data under the null hypothesis (detailed explanation can be found in step 7. The p-value indicates the proportion of times in which the bootstrap test statistic is greater than the observed test statistic. Figure 80 displays the likelihood ratio test statistic from the null distribution, with the red line indicates the likelihood ratio test statistic using the actual data.
There is overwhelming statistical evidence (\(\chi^2\) = 208.623 and \(p\)-value = 0 from Table 50) that the larger model (including random slope at level two) is the better model.
| var | 2.5 % | 97.5 % |
|---|---|---|
| sd_(Intercept)|qcaa_school_id | 0.8735519 | 1.1692900 |
| cor_year10.(Intercept)|qcaa_school_id | -0.8548733 | -0.7237709 |
| sd_year10|qcaa_school_id | 0.0831679 | 0.1154846 |
| sigma | 0.5316834 | 0.5674079 |
| (Intercept) | 1.3083256 | 2.5093735 |
| year10 | -0.0246597 | 0.0921339 |
| sectorGovernment | -0.4923102 | 0.7377683 |
| sectorIndependent | -1.5531034 | 0.0525608 |
| unityear_12_enrolments | -0.2032051 | 0.1515589 |
| year10:sectorGovernment | 0.0124112 | 0.1429585 |
| year10:sectorIndependent | -0.0102259 | 0.1449415 |
| year10:unityear_12_enrolments | 0.0026681 | 0.0279735 |
| sectorGovernment:unityear_12_enrolments | -0.3285803 | 0.0027383 |
| sectorIndependent:unityear_12_enrolments | -0.4816025 | -0.0720295 |
The parametric bootstrap is utilised to construct confidence intervals (detailed explanation in step 8) for the random effects. If the confidence intervals between the random effects does not include 0, it provides statistical evidence that the p-value is less than 0.5. In other words, it suggests that the random effects and the correlation between the random effects are significant at the 5% level. The confidence interval for the random effects all exclude 0 (Table 51), indicating that they’re different from 0 in the population (i.e. statistically significant).
Level one (measurement variable) \[Y_{tij} = \pi_{0ij} + \pi_{1ij}year10_{tij} + \epsilon_{tij}\]
Level two (schools within districts) \[\begin{aligned} \pi_{0ij} &= \beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij} \\ \pi_{1ij} &= \beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12}unit_{ij} + u_{1ij} \end{aligned}\]
Therefore, the composite model can be written as
\[\begin{aligned} Y_{tij} =\ & \pi_{0ij} + \pi_{1ij}year10_{tij} + \epsilon_{tij} \\ =\ & (\beta_{00j} + \beta_{01j}sector_{ij} + \beta_{02j}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + u_{0ij}) + \\ & (\beta_{10j} + \beta_{11j}sector_{ij} + \beta_{12}unit_{ij} + u_{1ij})year10_{tij} + \epsilon_{tij} \\ =\ & \left[\beta_{00j} + \beta_{01}sector_{ij} + \beta_{02}unit_{ij} + \beta_{03j}sector_{ij}unit_{ij} + \beta_{10j}year10_{tij} + \beta_{11j}sector_{ij}year_{tij} + \beta_{12}unit_{ij}year10_{tij} \right] \left[u_{0ij} + u_{u_{1ij}} + \epsilon_{tij} \right] \end{aligned}\]
summary(model)
## Groups Name Variance Std.Dev. Corr
## qcaa_school_id (Intercept) 1.0471207 1.023289
## year10 0.0097253 0.098617 -0.802
## Residual 0.3036501 0.551045
## Estimate Std. Error t value
## (Intercept) 1.92746582 0.296641578 6.497625
## year10 0.03386364 0.030295845 1.117765
## sectorGovernment 0.11738215 0.315770024 0.371733
## sectorIndependent -0.78920786 0.393905776 -2.003545
## unityear_12_enrolments -0.02478714 0.091733860 -0.270207
## year10:sectorGovernment 0.07898180 0.032078517 2.462140
## year10:sectorIndependent 0.06729049 0.038678772 1.739727
## year10:unityear_12_enrolments 0.01479030 0.006765064 2.186277
## sectorGovernment:unityear_12_enrolments -0.15761038 0.082352025 -1.913862
## sectorIndependent:unityear_12_enrolments -0.27451131 0.102332610 -2.682540
## Number of Level Two groups = 196
## Number of Level Three groups = NA
Based on the model output, government schools are estimated to have a mean increase of 11.9459% (\((e^{0.0248299 + 0.0197221} - 1) * \times 100\)) per year, which is 8.21846% (\((e^{0.0789818} - 1) * \times 100\)) greater than the increase in enrolments in catholic schools. Independent schools showed a large increase in enrolments (10.6447%) per year, on average. This increase in enrolments is 6.96062% (\((exp^{-0.0221572} - 1) * 100\)) more than that of catholic schools.
Figure 81: Fixed effects of the final model for Science in Practice
The results can be better visualised in Figure 81. Catholic schools are estimated to have smallest growth (as shown by the gentle slope) relative to the other two sectors. By the same token, year 12 enrolments appears to be increasing at a faster rate than that of year 11 units, this may indicate that students may be opting to take the unit in year 10 and re-enrol in year 12.
Figure 82: Random effects for all schools
Figure 82 shows the random effects for all 196 schools that offered the subject. There is a clear negative correlation (-0.80) between the random intercept and the random slope, which indicates that in general, schools with lesser enrolments are generally matched with a larger increase in enrolments over the years.
Figure 83: Model predictions for year 11 enrolments for 20 randomly selected schools
Figure 83 above shows the predictions for 20 randomly selected schools.
This report explores the various trends in enrolments in the old and new QCE system. In general, a large increase in enrolments can be seen in the new QCE system, which is attributable to the 2007 prep year cohort leaving the school system, allowing each year (from 2020 onwards) in the new QCE system to have a full cohort.
A multilevel model was built to predict enrolments for year 11 and year 12. Some models showed interesting findings, such as having a large negative correlation with random slope and intercept; hinting that larger schools are matched with a lower change in enrolments across the years. School districts appear to play a large role in predicting enrolments, and most models showed that there were stark differences in enrolments across the different districts.
This report offers a broad overview of the enrolments in Queensland, Australia. An advantage of using multilevel model is that both intra-school change (i.e. how school enrolments are changing overtime) and interindividual change (i.e. Differences in temporal change across schools) may be recorded. Thus, the analysis will shift to creating a (shiny) web application to zoom in at the more granular (e.g. single school) statistics and information. The interactive features of the application will allow for a more personalised platform for the different needs of users, from extracting information about a single school to comparing enrolments across the different districts. It will include It also aims to incorporate spatial analysis such as the map of Queensland schools to further scrutinise the enrolment trends in the new QCE system.
The programming language used to analyse trends in enrolments for senior secondary (Year 11 and Year 12) mathematics and science subjects in Queensland is R (4.0.2) (R Core Team 2020), and the platform used for running R language is RStudio (1.4.1717) (RStudio Team 2020).
The following packages have been included in our Rmd file.